update,

2025-01-31 22:21:55 +08:00
parent 3688f9ee24
commit ae3970ff3c
90 changed files with 734370 additions and 0 deletions
--- a/tunmnlu/task_3/Skeleton/Q2/submit_12_lts/q2.dbc
+++ b/tunmnlu/task_3/Skeleton/Q2/submit_12_lts/q2.dbc
--- a/tunmnlu/task_3/Skeleton/Q2/submit_12_lts/q2.scala
+++ b/tunmnlu/task_3/Skeleton/Q2/submit_12_lts/q2.scala
@@ -0,0 +1,249 @@
+// Databricks notebook source
+// STARTER CODE - DO NOT EDIT THIS CELL
+import org.apache.spark.sql.functions.desc
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+import spark.implicits._
+import org.apache.spark.sql.expressions.Window
+
+// COMMAND ----------
+
+// STARTER CODE - DO NOT EDIT THIS CELL
+spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
+
+// COMMAND ----------
+
+// STARTER CODE - DO NOT EDIT THIS CELL
+val customSchema = StructType(Array(StructField("lpep_pickup_datetime", StringType, true), StructField("lpep_dropoff_datetime", StringType, true), StructField("PULocationID", IntegerType, true), StructField("DOLocationID", IntegerType, true), StructField("passenger_count", IntegerType, true), StructField("trip_distance", FloatType, true), StructField("fare_amount", FloatType, true), StructField("payment_type", IntegerType, true)))
+
+// COMMAND ----------
+
+// STARTER CODE - YOU CAN LOAD ANY FILE WITH A SIMILAR SYNTAX.
+val df = spark.read
+   .format("com.databricks.spark.csv")
+   .option("header", "true") // Use first line of all files as header
+   .option("nullValue", "null")
+   .schema(customSchema)
+   .load("/FileStore/tables/nyc_tripdata.csv") // the csv file which you want to work with
+   .withColumn("pickup_datetime", from_unixtime(unix_timestamp(col("lpep_pickup_datetime"), "MM/dd/yyyy HH:mm")))
+   .withColumn("dropoff_datetime", from_unixtime(unix_timestamp(col("lpep_dropoff_datetime"), "MM/dd/yyyy HH:mm")))
+   .drop($"lpep_pickup_datetime")
+   .drop($"lpep_dropoff_datetime")
+
+// COMMAND ----------
+
+// LOAD THE "taxi_zone_lookup.csv" FILE SIMILARLY AS ABOVE. CAST ANY COLUMN TO APPROPRIATE DATA TYPE IF NECESSARY.
+val customSchema = StructType(
+  Array(
+    StructField("LocationID", IntegerType, true),
+    StructField("Borough", StringType, true), 
+    StructField("Zone", StringType, true), 
+    StructField("service_zone", StringType, true)  
+    )
+  )
+
+val df2 = spark.read
+   .format("com.databricks.spark.csv")
+   .option("header", "true") // Use first line of all files as header
+   .option("nullValue", "null")
+   .schema(customSchema)
+   .load("/FileStore/tables/taxi_zone_lookup.csv") // the csv file which you want to work with
+
+// ENTER THE CODE BELOW
+
+// COMMAND ----------
+
+// STARTER CODE - DO NOT EDIT THIS CELL
+// Some commands that you can use to see your dataframes and results of the operations. You can comment the df.show(5) and uncomment display(df) to see the data differently. You will find these two functions useful in reporting your results.
+display(df)
+// df.show(5) // view the first 5 rows of the dataframe
+
+// COMMAND ----------
+
+// STARTER CODE - DO NOT EDIT THIS CELL
+// Filter the data to only keep the rows where "PULocationID" and the "DOLocationID" are different and the "trip_distance" is strictly greater than 2.0 (>2.0).
+
+// VERY VERY IMPORTANT: ALL THE SUBSEQUENT OPERATIONS MUST BE PERFORMED ON THIS FILTERED DATA
+
+val df_filter = df.filter($"PULocationID" =!= $"DOLocationID" && $"trip_distance" > 2.0)
+df_filter.show(5)
+
+// COMMAND ----------
+
+// PART 1a: List the top-5 most popular locations for dropoff based on "DOLocationID", sorted in descending order by popularity. If there is a tie, then the one with the lower "DOLocationID" gets listed first
+
+// Output Schema: DOLocationID int, number_of_dropoffs int 
+
+// Hint: Checkout the groupBy(), orderBy() and count() functions.
+
+// ENTER THE CODE BELOW
+val dropoff = df_filter.groupBy("DOLocationID").count()
+  .orderBy($"count".desc, $"DOLocationID".asc)
+  .withColumnRenamed("count", "number_of_dropoffs")
+
+display(dropoff.limit(5))
+
+
+// COMMAND ----------
+
+// PART 1b: List the top-5 most popular locations for pickup based on "PULocationID", sorted in descending order by popularity. If there is a tie, then the one with the lower "PULocationID" gets listed first.
+ 
+// Output Schema: PULocationID int, number_of_pickups int
+
+// Hint: Code is very similar to part 1a above.
+
+// ENTER THE CODE BELOW
+val pickup = df_filter.groupBy("PULocationID").count()
+  .orderBy($"count".desc, $"PULocationID".asc)
+  .withColumnRenamed("count", "number_of_pickups")
+  
+display(pickup.limit(5))
+
+
+// COMMAND ----------
+
+// PART 2: List the top-3 locationID’s with the maximum overall activity. Here, overall activity at a LocationID is simply the sum of all pickups and all dropoffs at that LocationID. In case of a tie, the lower LocationID gets listed first.
+
+//Note: If a taxi picked up 3 passengers at once, we count it as 1 pickup and not 3 pickups.
+
+// Output Schema: LocationID int, number_activities int
+
+// Hint: In order to get the result, you may need to perform a join operation between the two dataframes that you created in earlier parts (to come up with the sum of the number of pickups and dropoffs on each location). 
+
+// ENTER THE CODE BELOW
+val PU = df_filter.select("PULocationID").distinct()
+val DO = df_filter.select("DOLocationID").distinct()
+ 
+val locations = PU.union(DO).distinct()
+ 
+val temp_df = locations
+  .join(pickup, locations.col("PULocationID") === pickup.col("PULocationID") , "left")
+  .join(dropoff, locations.col("PULocationID") === dropoff.col("DOLocationID") , "left")
+ 
+val final_res = temp_df.withColumn("total_activity", col("number_of_pickups")+col("number_of_dropoffs"))
+ 
+val total_activity = final_res.select("DOLocationID", "total_activity")
+  .withColumnRenamed("DOLocationID", "LocationID")
+  .withColumnRenamed("total_activity", "number_activities")
+  .orderBy($"number_activities".desc)
+
+display(total_activity.limit(3))
+ 
+
+
+// COMMAND ----------
+
+// PART 3: List all the boroughs (of NYC: Manhattan, Brooklyn, Queens, Staten Island, Bronx along with "Unknown" and "EWR") and their total number of activities, in descending order of total number of activities. Here, the total number of activities for a borough (e.g., Queens) is the sum of the overall activities (as defined in part 2) of all the LocationIDs that fall in that borough (Queens). 
+
+// Output Schema: Borough string, total_number_activities int
+
+// Hint: You can use the dataframe obtained from the previous part, and will need to do the join with the 'taxi_zone_lookup' dataframe. Also, checkout the "agg" function applied to a grouped dataframe.
+
+// ENTER THE CODE BELOW
+val df_boro = total_activity.join(df2, Seq("LocationID"), "inner")
+ 
+val final_res = df_boro.groupBy("Borough")
+  .agg(sum("number_activities"))
+  .withColumnRenamed("sum(number_activities)","total_number_activities")
+  .orderBy($"total_number_activities".desc)
+
+display(final_res)
+
+
+// COMMAND ----------
+
+// PART 4: List the top 2 days of week with the largest number of daily average pickups, along with the average number of pickups on each of the 2 days in descending order (no rounding off required). Here, the average pickup is calculated by taking an average of the number of pick-ups on different dates falling on the same day of the week. For example, 02/01/2021, 02/08/2021 and 02/15/2021 are all Mondays, so the average pick-ups for these is the sum of the pickups on each date divided by 3.
+
+//Note: The day of week is a string of the day’s full spelling, e.g., "Monday" instead of the		number 1 or "Mon". Also, the pickup_datetime is in the format: yyyy-mm-dd.
+
+// Output Schema: day_of_week string, avg_count float
+
+// Hint: You may need to group by the "date" (without time stamp - time in the day) first. Checkout "to_date" function.
+
+// ENTER THE CODE BELOW
+val df_day = df_filter
+  .withColumn("day_of_week", date_format(col("pickup_datetime"), "EEEE"))
+  .withColumn("day", to_date(col("pickup_datetime")))
+ 
+val df_aggr = df_day.groupBy("day_of_week","day")
+  .agg(count("PULocationID"))
+  .withColumnRenamed("count(PULocationID)", "pu_count")
+  
+val df_avg = df_aggr.groupBy("day_of_week")
+  .agg(avg("pu_count"))
+  .withColumnRenamed("avg(pu_count)", "avg_count")
+  .orderBy($"avg_count".desc)
+  .limit(2)
+ 
+display(df_avg)
+
+
+// COMMAND ----------
+
+// PART 5: For each hour of a day (0 to 23, 0 being midnight) - in the order from 0 to 23(inclusively), find the zone in the Brooklyn borough with the LARGEST number of total pickups. 
+
+//Note: All dates for each hour should be included.
+
+// Output Schema: hour_of_day int, zone string, max_count int
+
+// Hint: You may need to use "Window" over hour of day, along with "group by" to find the MAXIMUM count of pickups
+
+// ENTER THE CODE BELOW
+val brooklyn = df2.filter(col("Borough")==="Brooklyn")
+ 
+val df_join = df_filter.join(brooklyn, df_filter.col("PULocationID")===brooklyn.col("LocationID"))
+val df_time = df_join.withColumn("hour_of_day", hour(col("pickup_datetime")))
+ 
+//sum number of pickups, group by hour and zone 
+val df_group = df_time.groupBy("hour_of_day", "Zone")
+  .agg(count("PULocationID"))
+  .withColumnRenamed("count(PULocationID)", "count")
+ 
+val window_val = Window.partitionBy("hour_of_day")
+  .orderBy(col("count").desc)
+  
+val final_df = df_group.withColumn("rn", row_number().over(window_val))
+  .where(col("rn") === 1)
+  .withColumnRenamed("count", "max_count")
+  .select("hour_of_day", "zone", "max_count")
+  .orderBy(col("hour_of_day"))
+
+display(final_df)
+
+
+// COMMAND ----------
+
+// PART 6 - Find which 3 different days in the month of January, in Manhattan, saw the largest positive percentage increase in pick-ups compared to the previous day, in the order from largest percentage increase to smallest percentage increase 
+
+// Note: All years need to be aggregated to calculate the pickups for a specific day of January. The change from Dec 31 to Jan 1 can be excluded.
+
+// Output Schema: day int, percent_change float
+
+
+// Hint: You might need to use lag function, over a window ordered by day of month.
+
+// ENTER THE CODE BELOW
+val mnhtn = df2.filter(col("Borough")==="Manhattan")
+ 
+val df_join = df_filter.join(mnhtn, df_filter.col("PULocationID")===mnhtn.col("LocationID"))
+
+val df_day = df_join.withColumn("month", month(col("pickup_datetime")))
+  .withColumn("date", to_date(col("pickup_datetime")))
+  .withColumn("year", year(col("pickup_datetime")))
+  .filter(col("month")===1 && col("year")===2019)
+ 
+val df_grouped = df_day.groupBy("date")
+  .agg(count("PULocationID"))
+  .withColumnRenamed("count(PULocationID)", "count")
+  .orderBy(col("date"))
+
+val w = Window.orderBy("date")
+
+val df_delta = df_grouped
+  .withColumn("diff_count", col("count") - when((lag("count", 1).over(w)).isNull, 0).otherwise(lag("count", 1).over(w)))
+  .withColumn("percent_change", round((col("count") - lag("count", 1).over(w))/ lag("count", 1).over(w)*100,2))
+  .withColumn("day", date_format(col("date"), "d"))
+  .orderBy(col("percent_change").desc)
+  
+display(df_delta.select("day", "percent_change").limit(3))
+
--- a/tunmnlu/task_3/Skeleton/Q2/submit_12_lts/q2_results.csv
+++ b/tunmnlu/task_3/Skeleton/Q2/submit_12_lts/q2_results.csv
@@ -0,0 +1,63 @@
+Part 1a: top-5 most popular drop locations,,
+DOLocationID,number_of_dropoffs,
+61,5937,
+138,5146,
+239,4133,
+244,4006,
+42,3859,
+Part 1b: top-5 most popular pickup locations,,
+PULocationID,number_of_pickups,
+74,17360,
+75,13299,
+244,9958,
+41,9645,
+82,9306,
+Part 2: top-3 locations with the maximum overall activity,,
+locationID,total_activity,
+74,20292,
+75,16326,
+244,13964,
+Part 3: all the boroughs in the order of having the highest to lowest number of activities,,
+Borough,total_number_activities,
+Brooklyn,198506,
+Manhattan,175953,
+Queens,157633,
+Bronx,67707,
+Unknown,1215,
+Staten Island,888,
+EWR,104,
+Part 4: top 2 days of week with the largest number of (daily) average pickups - along with the values of average number of pickups on each of the two days,,
+day_of_week,avg_count,
+Wednesday,10257.6,
+Saturday,9884.75,
+Part 5: For each particular hour of a day (0 to 23 - 0 being midnight) - in their order from 0 to 23. Find the zone in Brooklyn borough with the LARGEST number of pickups,,
+hour_of_day,Zone,max_count
+0,Williamsburg (North Side),569
+1,Williamsburg (North Side),460
+2,Williamsburg (North Side),429
+3,Williamsburg (North Side),357
+4,Williamsburg (North Side),228
+5,East Williamsburg,89
+6,East New York,149
+7,Brooklyn Heights,307
+8,Brooklyn Heights,511
+9,Brooklyn Heights,574
+10,Brooklyn Heights,502
+11,Brooklyn Heights,563
+12,Brooklyn Heights,491
+13,Brooklyn Heights,472
+14,Fort Greene,511
+15,Fort Greene,559
+16,Fort Greene,658
+17,Downtown Brooklyn/MetroTech,651
+18,Fort Greene,736
+19,Fort Greene,647
+20,Fort Greene,558
+21,Fort Greene,556
+22,DUMBO/Vinegar Hill,611
+23,Williamsburg (North Side),613
+Part 6: Which 3 different days of the January - in Manhattan - saw the largest percentage increment in pickups compared to previous day,,
+day, percent_change,
+22,51.06,
+2,28.38,
+28,22.94,
--- a/tunmnlu/task_3/Skeleton/Q2/submit_12_lts/q2_tlou31.html
+++ b/tunmnlu/task_3/Skeleton/Q2/submit_12_lts/q2_tlou31.html