This commit is contained in:
louiscklaw
2025-01-31 22:21:55 +08:00
parent 3688f9ee24
commit ae3970ff3c
90 changed files with 734370 additions and 0 deletions

Binary file not shown.

View File

@@ -0,0 +1,249 @@
// Databricks notebook source
// STARTER CODE - DO NOT EDIT THIS CELL
import org.apache.spark.sql.functions.desc
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import spark.implicits._
import org.apache.spark.sql.expressions.Window
// COMMAND ----------
// STARTER CODE - DO NOT EDIT THIS CELL
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
// COMMAND ----------
// STARTER CODE - DO NOT EDIT THIS CELL
val customSchema = StructType(Array(StructField("lpep_pickup_datetime", StringType, true), StructField("lpep_dropoff_datetime", StringType, true), StructField("PULocationID", IntegerType, true), StructField("DOLocationID", IntegerType, true), StructField("passenger_count", IntegerType, true), StructField("trip_distance", FloatType, true), StructField("fare_amount", FloatType, true), StructField("payment_type", IntegerType, true)))
// COMMAND ----------
// STARTER CODE - YOU CAN LOAD ANY FILE WITH A SIMILAR SYNTAX.
val df = spark.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("nullValue", "null")
.schema(customSchema)
.load("/FileStore/tables/nyc_tripdata.csv") // the csv file which you want to work with
.withColumn("pickup_datetime", from_unixtime(unix_timestamp(col("lpep_pickup_datetime"), "MM/dd/yyyy HH:mm")))
.withColumn("dropoff_datetime", from_unixtime(unix_timestamp(col("lpep_dropoff_datetime"), "MM/dd/yyyy HH:mm")))
.drop($"lpep_pickup_datetime")
.drop($"lpep_dropoff_datetime")
// COMMAND ----------
// LOAD THE "taxi_zone_lookup.csv" FILE SIMILARLY AS ABOVE. CAST ANY COLUMN TO APPROPRIATE DATA TYPE IF NECESSARY.
val customSchema = StructType(
Array(
StructField("LocationID", IntegerType, true),
StructField("Borough", StringType, true),
StructField("Zone", StringType, true),
StructField("service_zone", StringType, true)
)
)
val df2 = spark.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("nullValue", "null")
.schema(customSchema)
.load("/FileStore/tables/taxi_zone_lookup.csv") // the csv file which you want to work with
// ENTER THE CODE BELOW
// COMMAND ----------
// STARTER CODE - DO NOT EDIT THIS CELL
// Some commands that you can use to see your dataframes and results of the operations. You can comment the df.show(5) and uncomment display(df) to see the data differently. You will find these two functions useful in reporting your results.
display(df)
// df.show(5) // view the first 5 rows of the dataframe
// COMMAND ----------
// STARTER CODE - DO NOT EDIT THIS CELL
// Filter the data to only keep the rows where "PULocationID" and the "DOLocationID" are different and the "trip_distance" is strictly greater than 2.0 (>2.0).
// VERY VERY IMPORTANT: ALL THE SUBSEQUENT OPERATIONS MUST BE PERFORMED ON THIS FILTERED DATA
val df_filter = df.filter($"PULocationID" =!= $"DOLocationID" && $"trip_distance" > 2.0)
df_filter.show(5)
// COMMAND ----------
// PART 1a: List the top-5 most popular locations for dropoff based on "DOLocationID", sorted in descending order by popularity. If there is a tie, then the one with the lower "DOLocationID" gets listed first
// Output Schema: DOLocationID int, number_of_dropoffs int
// Hint: Checkout the groupBy(), orderBy() and count() functions.
// ENTER THE CODE BELOW
val dropoff = df_filter.groupBy("DOLocationID").count()
.orderBy($"count".desc, $"DOLocationID".asc)
.withColumnRenamed("count", "number_of_dropoffs")
display(dropoff.limit(5))
// COMMAND ----------
// PART 1b: List the top-5 most popular locations for pickup based on "PULocationID", sorted in descending order by popularity. If there is a tie, then the one with the lower "PULocationID" gets listed first.
// Output Schema: PULocationID int, number_of_pickups int
// Hint: Code is very similar to part 1a above.
// ENTER THE CODE BELOW
val pickup = df_filter.groupBy("PULocationID").count()
.orderBy($"count".desc, $"PULocationID".asc)
.withColumnRenamed("count", "number_of_pickups")
display(pickup.limit(5))
// COMMAND ----------
// PART 2: List the top-3 locationIDs with the maximum overall activity. Here, overall activity at a LocationID is simply the sum of all pickups and all dropoffs at that LocationID. In case of a tie, the lower LocationID gets listed first.
//Note: If a taxi picked up 3 passengers at once, we count it as 1 pickup and not 3 pickups.
// Output Schema: LocationID int, number_activities int
// Hint: In order to get the result, you may need to perform a join operation between the two dataframes that you created in earlier parts (to come up with the sum of the number of pickups and dropoffs on each location).
// ENTER THE CODE BELOW
val PU = df_filter.select("PULocationID").distinct()
val DO = df_filter.select("DOLocationID").distinct()
val locations = PU.union(DO).distinct()
val temp_df = locations
.join(pickup, locations.col("PULocationID") === pickup.col("PULocationID") , "left")
.join(dropoff, locations.col("PULocationID") === dropoff.col("DOLocationID") , "left")
val final_res = temp_df.withColumn("total_activity", col("number_of_pickups")+col("number_of_dropoffs"))
val total_activity = final_res.select("DOLocationID", "total_activity")
.withColumnRenamed("DOLocationID", "LocationID")
.withColumnRenamed("total_activity", "number_activities")
.orderBy($"number_activities".desc)
display(total_activity.limit(3))
// COMMAND ----------
// PART 3: List all the boroughs (of NYC: Manhattan, Brooklyn, Queens, Staten Island, Bronx along with "Unknown" and "EWR") and their total number of activities, in descending order of total number of activities. Here, the total number of activities for a borough (e.g., Queens) is the sum of the overall activities (as defined in part 2) of all the LocationIDs that fall in that borough (Queens).
// Output Schema: Borough string, total_number_activities int
// Hint: You can use the dataframe obtained from the previous part, and will need to do the join with the 'taxi_zone_lookup' dataframe. Also, checkout the "agg" function applied to a grouped dataframe.
// ENTER THE CODE BELOW
val df_boro = total_activity.join(df2, Seq("LocationID"), "inner")
val final_res = df_boro.groupBy("Borough")
.agg(sum("number_activities"))
.withColumnRenamed("sum(number_activities)","total_number_activities")
.orderBy($"total_number_activities".desc)
display(final_res)
// COMMAND ----------
// PART 4: List the top 2 days of week with the largest number of daily average pickups, along with the average number of pickups on each of the 2 days in descending order (no rounding off required). Here, the average pickup is calculated by taking an average of the number of pick-ups on different dates falling on the same day of the week. For example, 02/01/2021, 02/08/2021 and 02/15/2021 are all Mondays, so the average pick-ups for these is the sum of the pickups on each date divided by 3.
//Note: The day of week is a string of the days full spelling, e.g., "Monday" instead of the number 1 or "Mon". Also, the pickup_datetime is in the format: yyyy-mm-dd.
// Output Schema: day_of_week string, avg_count float
// Hint: You may need to group by the "date" (without time stamp - time in the day) first. Checkout "to_date" function.
// ENTER THE CODE BELOW
val df_day = df_filter
.withColumn("day_of_week", date_format(col("pickup_datetime"), "EEEE"))
.withColumn("day", to_date(col("pickup_datetime")))
val df_aggr = df_day.groupBy("day_of_week","day")
.agg(count("PULocationID"))
.withColumnRenamed("count(PULocationID)", "pu_count")
val df_avg = df_aggr.groupBy("day_of_week")
.agg(avg("pu_count"))
.withColumnRenamed("avg(pu_count)", "avg_count")
.orderBy($"avg_count".desc)
.limit(2)
display(df_avg)
// COMMAND ----------
// PART 5: For each hour of a day (0 to 23, 0 being midnight) - in the order from 0 to 23(inclusively), find the zone in the Brooklyn borough with the LARGEST number of total pickups.
//Note: All dates for each hour should be included.
// Output Schema: hour_of_day int, zone string, max_count int
// Hint: You may need to use "Window" over hour of day, along with "group by" to find the MAXIMUM count of pickups
// ENTER THE CODE BELOW
val brooklyn = df2.filter(col("Borough")==="Brooklyn")
val df_join = df_filter.join(brooklyn, df_filter.col("PULocationID")===brooklyn.col("LocationID"))
val df_time = df_join.withColumn("hour_of_day", hour(col("pickup_datetime")))
//sum number of pickups, group by hour and zone
val df_group = df_time.groupBy("hour_of_day", "Zone")
.agg(count("PULocationID"))
.withColumnRenamed("count(PULocationID)", "count")
val window_val = Window.partitionBy("hour_of_day")
.orderBy(col("count").desc)
val final_df = df_group.withColumn("rn", row_number().over(window_val))
.where(col("rn") === 1)
.withColumnRenamed("count", "max_count")
.select("hour_of_day", "zone", "max_count")
.orderBy(col("hour_of_day"))
display(final_df)
// COMMAND ----------
// PART 6 - Find which 3 different days in the month of January, in Manhattan, saw the largest positive percentage increase in pick-ups compared to the previous day, in the order from largest percentage increase to smallest percentage increase
// Note: All years need to be aggregated to calculate the pickups for a specific day of January. The change from Dec 31 to Jan 1 can be excluded.
// Output Schema: day int, percent_change float
// Hint: You might need to use lag function, over a window ordered by day of month.
// ENTER THE CODE BELOW
val mnhtn = df2.filter(col("Borough")==="Manhattan")
val df_join = df_filter.join(mnhtn, df_filter.col("PULocationID")===mnhtn.col("LocationID"))
val df_day = df_join.withColumn("month", month(col("pickup_datetime")))
.withColumn("date", to_date(col("pickup_datetime")))
.withColumn("year", year(col("pickup_datetime")))
.filter(col("month")===1 && col("year")===2019)
val df_grouped = df_day.groupBy("date")
.agg(count("PULocationID"))
.withColumnRenamed("count(PULocationID)", "count")
.orderBy(col("date"))
val w = Window.orderBy("date")
val df_delta = df_grouped
.withColumn("diff_count", col("count") - when((lag("count", 1).over(w)).isNull, 0).otherwise(lag("count", 1).over(w)))
.withColumn("percent_change", round((col("count") - lag("count", 1).over(w))/ lag("count", 1).over(w)*100,2))
.withColumn("day", date_format(col("date"), "d"))
.orderBy(col("percent_change").desc)
display(df_delta.select("day", "percent_change").limit(3))

View File

@@ -0,0 +1,63 @@
Part 1a: top-5 most popular drop locations,,
DOLocationID,number_of_dropoffs,
61,5937,
138,5146,
239,4133,
244,4006,
42,3859,
Part 1b: top-5 most popular pickup locations,,
PULocationID,number_of_pickups,
74,17360,
75,13299,
244,9958,
41,9645,
82,9306,
Part 2: top-3 locations with the maximum overall activity,,
locationID,total_activity,
74,20292,
75,16326,
244,13964,
Part 3: all the boroughs in the order of having the highest to lowest number of activities,,
Borough,total_number_activities,
Brooklyn,198506,
Manhattan,175953,
Queens,157633,
Bronx,67707,
Unknown,1215,
Staten Island,888,
EWR,104,
Part 4: top 2 days of week with the largest number of (daily) average pickups - along with the values of average number of pickups on each of the two days,,
day_of_week,avg_count,
Wednesday,10257.6,
Saturday,9884.75,
Part 5: For each particular hour of a day (0 to 23 - 0 being midnight) - in their order from 0 to 23. Find the zone in Brooklyn borough with the LARGEST number of pickups,,
hour_of_day,Zone,max_count
0,Williamsburg (North Side),569
1,Williamsburg (North Side),460
2,Williamsburg (North Side),429
3,Williamsburg (North Side),357
4,Williamsburg (North Side),228
5,East Williamsburg,89
6,East New York,149
7,Brooklyn Heights,307
8,Brooklyn Heights,511
9,Brooklyn Heights,574
10,Brooklyn Heights,502
11,Brooklyn Heights,563
12,Brooklyn Heights,491
13,Brooklyn Heights,472
14,Fort Greene,511
15,Fort Greene,559
16,Fort Greene,658
17,Downtown Brooklyn/MetroTech,651
18,Fort Greene,736
19,Fort Greene,647
20,Fort Greene,558
21,Fort Greene,556
22,DUMBO/Vinegar Hill,611
23,Williamsburg (North Side),613
Part 6: Which 3 different days of the January - in Manhattan - saw the largest percentage increment in pickups compared to previous day,,
day, percent_change,
22,51.06,
2,28.38,
28,22.94,
1 Part 1a: top-5 most popular drop locations
2 DOLocationID number_of_dropoffs
3 61 5937
4 138 5146
5 239 4133
6 244 4006
7 42 3859
8 Part 1b: top-5 most popular pickup locations
9 PULocationID number_of_pickups
10 74 17360
11 75 13299
12 244 9958
13 41 9645
14 82 9306
15 Part 2: top-3 locations with the maximum overall activity
16 locationID total_activity
17 74 20292
18 75 16326
19 244 13964
20 Part 3: all the boroughs in the order of having the highest to lowest number of activities
21 Borough total_number_activities
22 Brooklyn 198506
23 Manhattan 175953
24 Queens 157633
25 Bronx 67707
26 Unknown 1215
27 Staten Island 888
28 EWR 104
29 Part 4: top 2 days of week with the largest number of (daily) average pickups - along with the values of average number of pickups on each of the two days
30 day_of_week avg_count
31 Wednesday 10257.6
32 Saturday 9884.75
33 Part 5: For each particular hour of a day (0 to 23 - 0 being midnight) - in their order from 0 to 23. Find the zone in Brooklyn borough with the LARGEST number of pickups
34 hour_of_day Zone max_count
35 0 Williamsburg (North Side) 569
36 1 Williamsburg (North Side) 460
37 2 Williamsburg (North Side) 429
38 3 Williamsburg (North Side) 357
39 4 Williamsburg (North Side) 228
40 5 East Williamsburg 89
41 6 East New York 149
42 7 Brooklyn Heights 307
43 8 Brooklyn Heights 511
44 9 Brooklyn Heights 574
45 10 Brooklyn Heights 502
46 11 Brooklyn Heights 563
47 12 Brooklyn Heights 491
48 13 Brooklyn Heights 472
49 14 Fort Greene 511
50 15 Fort Greene 559
51 16 Fort Greene 658
52 17 Downtown Brooklyn/MetroTech 651
53 18 Fort Greene 736
54 19 Fort Greene 647
55 20 Fort Greene 558
56 21 Fort Greene 556
57 22 DUMBO/Vinegar Hill 611
58 23 Williamsburg (North Side) 613
59 Part 6: Which 3 different days of the January - in Manhattan - saw the largest percentage increment in pickups compared to previous day
60 day percent_change
61 22 51.06
62 2 28.38
63 28 22.94

File diff suppressed because one or more lines are too long