# HW3 - Q1 [15 pts]



## Important Notices

<div class="alert alert-block alert-danger">
    WARNING: Do <strong>NOT</strong> add any cells to this Jupyter Notebook, because that will crash the autograder.
</div>


All instructions, code comments, etc. in this notebook **are part of the assignment instructions**. That is, if there is instructions about completing a task in this notebook, that task is not optional.  



<div class="alert alert-block alert-info">
    You <strong>must</strong> implement the following functions in this notebook to receive credit.
</div>


`user()`

`clean_data()`

`common_pair()`

`time_of_cheapest_fare()`

`passenger_count_for_most_tip()`

`day_with_traffic()`



Each method will be auto-graded using different sets of parameters or data, to ensure that values are not hard-coded. You may assume we will only use your code to work with data from the NYC-TLC dataset during auto-grading. 

### Helper functions

You are permitted to write additional helper functions, or use additional instance variables so long as the previously described functions work as required.

<div class="alert alert-block alert-danger">
    WARNING: Do <strong>NOT</strong> remove or modify the following utility functions:
</div>

`load_data()`

#### Pyspark Imports
<span style="color:red">*Please don't modify the below cell*</span>

In [1]:
import pyspark
from pyspark.sql import SQLContext
from pyspark.sql.functions import hour, when, col, date_format, to_timestamp

#### Define Spark Context
<span style="color:red">*Please don't modify the below cell*</span>

In [2]:
sc = pyspark.SparkContext(appName="HW3-Q1")
sqlContext = SQLContext(sc)

#### Function to load data

<span style="color:red">*Please don't modify the below cell*</span>

In [3]:
def load_data():
    df = sqlContext.read.option("header",True).csv("yellow_tripdata_2019-01_short.csv")
    return df

In [4]:
df = load_data()
df.printSchema()

root
 |-- VendorID: string (nullable = true)
 |-- tpep_pickup_datetime: string (nullable = true)
 |-- tpep_dropoff_datetime: string (nullable = true)
 |-- passenger_count: string (nullable = true)
 |-- trip_distance: string (nullable = true)
 |-- RatecodeID: string (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- PULocationID: string (nullable = true)
 |-- DOLocationID: string (nullable = true)
 |-- payment_type: string (nullable = true)
 |-- fare_amount: string (nullable = true)
 |-- extra: string (nullable = true)
 |-- mta_tax: string (nullable = true)
 |-- tip_amount: string (nullable = true)
 |-- tolls_amount: string (nullable = true)
 |-- improvement_surcharge: string (nullable = true)
 |-- total_amount: string (nullable = true)
 |-- congestion_surcharge: string (nullable = true)



# Implement the functions below for this assignment:

## 1. Update the `user()` function
This function should return your GT username, eg: gburdell3

In [5]:
def user():
    """
    :return: string
    your GTUsername, NOT your 9-Digit GTId  
    """  
    return 'psrinivasan48'

## 1a. [1 pts] Casting the columns into correct types

To process the data accurately, cast the following columns into given data type: 

- `passenger_count` - integer 
- `total_amount` - float 
- `tip_amount` - float
- `trip_distance` - float 
- `fare_amount` - float 
- `tpep_pickup_datetime` - timestamp 
- `tpep_dropoff_datetime` - timestamp 

All of the columns in the original data should be returned with the above columns converted to the correct data type.

In [6]:
def clean_data(df):
    '''
    input: df a dataframe
    output: df a dataframe with the all the original columns
    '''
    
    # START YOUR CODE HERE ---------
    df = df.withColumn("passenger_count", df["passenger_count"].cast("Integer"))
    df = df.withColumn("total_amount", df["total_amount"].cast("float"))
    df = df.withColumn("tip_amount", df["tip_amount"].cast("float"))
    df = df.withColumn("trip_distance", df["trip_distance"].cast("float"))
    df = df.withColumn("fare_amount", df["fare_amount"].cast("float"))
    df = df.withColumn("tpep_pickup_datetime", df["tpep_pickup_datetime"].cast("timestamp"))
    df = df.withColumn("tpep_dropoff_datetime", df["tpep_dropoff_datetime"].cast("timestamp"))
    
    # END YOUR CODE HERE -----------
    return df

In [7]:
df = clean_data(df)
df.select(['passenger_count', 'total_amount', 'tip_amount', 'trip_distance', 'fare_amount', 'tpep_pickup_datetime', 'tpep_pickup_datetime']).printSchema()

root
 |-- passenger_count: integer (nullable = true)
 |-- total_amount: float (nullable = true)
 |-- tip_amount: float (nullable = true)
 |-- trip_distance: float (nullable = true)
 |-- fare_amount: float (nullable = true)
 |-- tpep_pickup_datetime: timestamp (nullable = true)
 |-- tpep_pickup_datetime: timestamp (nullable = true)



## 1b. [4 pts] What are the top 10 pickup-dropoff locations?

Find the top 10 pickup-dropoff location pairs having the most number of trips (`count`). The location pairs should be ordered by `count` in descending order. If two or more pairs have the same number of trips, break the tie using the trip amount per distance travelled (`trip_rate`). Use columns `total_amount` and `trip_distance` to calculate the trip amount per distance. In certain situations, the pickup and dropoff locations may be the same.

Example output showing expected formatting:

```
+------------+------------+-----+------------------+
|PULocationID|DOLocationID|count|         trip_rate|
+------------+------------+-----+------------------+
|           5|           7|   24| 5.148195749283391|
|           6|           4|   19| 1.420958193039484|
|           3|           2|   15|9.1928382713049282|
|           8|           8|   14|5.1029384838178493|
|           1|           3|    9|7.4403919838271223|
|           9|           2|    9|4.4039182884283829|
|           5|           7|    6|  5.19283827172823|
|           2|           1|    5| 9.233738511638532|
|           1|           9|    3| 8.293827128489212|
|           6|           6|    1| 4.192847382919223|
+------------+------------+-----+------------------+
```

In [8]:
def common_pair(df):
    '''
    input: df a dataframe
    output: df a dataframe with following columns:
            - PULocationID
            - DOLocationID
            - count
            - trip_rate
            
    trip_rate is the average amount (total_amount) per distance (trip_distance)
    
    '''
    
    # START YOUR CODE HERE ---------
    df = df.groupby(['PULocationId', 'DOLocationID'])\
    .agg(pyspark.sql.functions.avg(col("total_amount"))/pyspark.sql.functions.avg(col("trip_distance"))
         , pyspark.sql.functions.count("*"))\
    .withColumnRenamed('(avg(total_amount) / avg(trip_distance))', 'trip_rate')\
    .withColumnRenamed('count(1)', 'count')
                
    df = df.orderBy(col("count").desc(),col("trip_rate").desc()).limit(10)
    df = df.select("PULocationID", "DOLocationID", "count", "trip_rate")
    # END YOUR CODE HERE -----------
    
    return df

In [9]:
common_pair(df).show()

+------------+------------+-----+------------------+
|PULocationID|DOLocationID|count|         trip_rate|
+------------+------------+-----+------------------+
|         264|         264|   97| 5.482259531398455|
|         239|         238|   34| 8.395489315120459|
|         237|         236|   34|7.1150794250423965|
|         236|         236|   24|12.230708730972086|
|          79|          79|   23|10.641212116864102|
|         142|         239|   23|10.056728351507015|
|         148|          79|   23|  9.72959679025766|
|         263|         141|   23| 7.301437441278104|
|         141|         263|   22| 6.897755674061171|
|         170|         170|   21| 9.681594815392343|
+------------+------------+-----+------------------+



## 1c. [4 pts] When is the trip cheapest (day vs night) ?

Divide each day into two parts: Day (from 9am to 8:59:59pm), and Night (from 9pm to 8:59:59am) and find the average total amount per unit distance travelled (use column `total_amount`) for both time periods. Sort the result by `trip_rate` in ascending order to find when the fare rate is cheapest.

Example output showing expected formatting:
```
+---------+-----------------+
|day_night|        trip_rate|
+---------+-----------------+
|      Day|2.391827482920123|
|    Night|4.292818223839121|
+---------+-----------------+
```

In [10]:
def time_of_cheapest_fare(df):
    '''
    input: df a dataframe
    output: df a dataframe with following columns:
            - day_night
            - trip_rate
    
    day_night will have 'Day' or 'Night' based on following conditions:
        - From 9am to 8:59:59pm - Day
        - From 9pm to 8:59:59am - Night
            
    trip_rate is the average amount (total_amount) per distance
    
    '''
    
    # START YOUR CODE HERE ---------
    df = df.withColumn("hour", hour(col('tpep_pickup_datetime')).cast("Integer"))
    df = df.withColumn("day_night", 
                      when((col("hour") >= 9) & (col("hour") < 21), 'Day').otherwise('Night'))
    df = df.groupby(['day_night']).agg(pyspark.sql.functions.avg('total_amount')/
                                       pyspark.sql.functions.avg('trip_distance'))\
            .withColumnRenamed('(avg(total_amount) / avg(trip_distance))', 'trip_rate')
    # END YOUR CODE HERE -----------
    
    return df

In [11]:
time_of_cheapest_fare(df).show()

+---------+-----------------+
|day_night|        trip_rate|
+---------+-----------------+
|    Night|5.537330237759072|
|      Day|6.694883573588345|
+---------+-----------------+



## 1d. [3 pts] Which passenger group size gives the most tips?

Filter the data for trips having fares (`fare_amount`) greater than $2 and the number of passengers (`passenger_count`) greater than 0. Find the average fare and tip (`tip_amount`) for all passenger group sizes and calculate the tip percent (`tip_amount * 100 / fare_amount`). Sort by the tip percent in descending order to get which group size tips most generously.

Example output showing expected formatting:
```
+---------------+------------------+
|passenger_count|       tip_percent|
+---------------+------------------+
|              4|20.129473829283771|
|              2|16.203913838738283|
|              3|14.283814930283822|
|              1|13.393817383918287|
|              6| 12.73928273747182|
|              5|12.402938192848471|
+---------------+------------------+
```

In [12]:
def passenger_count_for_most_tip(df):
    '''
    input: df a dataframe
    output: df a dataframe with following columns:
            - passenger_count
            - tip_percent
            
    trip_percent is the percent of tip out of fare_amount
    
    '''
    
    # START YOUR CODE HERE ---------
    df = df.filter(col("fare_amount") > 2).filter(col("passenger_count") > 0)
    
    df = df.groupby(["passenger_count"]).agg(pyspark.sql.functions.avg(col("tip_amount")) * 100
                                             /pyspark.sql.functions.avg(col("fare_amount")))\
       .withColumnRenamed('((avg(tip_amount) * 100) / avg(fare_amount))', 'tip_percent').orderBy(col("tip_percent").desc())
    
    # END YOUR CODE HERE -----------
    
    return df

In [13]:
passenger_count_for_most_tip(df).show()

+---------------+------------------+
|passenger_count|       tip_percent|
+---------------+------------------+
|              2|14.406226490643459|
|              5|14.347176091134632|
|              1|13.816545488299298|
|              4|13.232489331848662|
|              3| 13.11005447222511|
|              6|13.096846913969197|
+---------------+------------------+



## 1e. [3 pts] Which day of week has the most traffic?

Sort the days of the week based on traffic, with the day having the highest traffic on the top. You can estimate traffic on the day of the week based on the average speed of all taxi trips on that day of the week. (Speed can be calculated by using the trip time and trip distance. Make sure to print speed in distance / hour). If the `average_speed` is the same for two or more days, the days should be ordered alphabetically. A day with a low average speed indicates high levels of traffic. The average speed may be 0 indicating very high levels of traffic. Not all days of the week may be present. You should use `date_format` along with the appropriate [pattern letters](https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html) to format the day of the week.

Example output showing expected formatting:
```
+-----------+------------------+
|day_of_week|     average_speed|
+-----------+------------------+
|        Sat|               0.0|
|        Tue|               0.0|
|        Fri|7.2938133827293934|
|        Mon|10.123938472718228|
+-----------+------------------+
```

In [14]:
def day_with_traffic(df):
    '''
    input: df a dataframe
    output: df a dataframe with following columns:
            - day_of_week
            - average_speed
    
    day_of_week should be day of week e.g.) Mon, Tue, Wed, ...
    average_speed (miles/hour) is calculated as distance / time (in hours)
    
    '''
    
    # START YOUR CODE HERE ---------
    
    df = df.withColumn('day_of_week', date_format(col('tpep_pickup_datetime'), 'E'))
    df = df.groupBy(['day_of_week']).agg(pyspark.sql.functions.avg(col('trip_distance'))/
                                         (pyspark.sql.functions.avg(
                (col('tpep_dropoff_datetime').cast("long") - col('tpep_pickup_datetime').cast("long"))/3600.0)))\
                .withColumnRenamed('(avg(trip_distance) / avg(((CAST(tpep_dropoff_datetime AS BIGINT) - CAST(tpep_pickup_datetime AS BIGINT)) / 3600.0)))'
                                   , 'average_speed')
    df = df.orderBy(col('average_speed').asc(), col('day_of_week').asc())
    # END YOUR CODE HERE -----------
    
    return df

In [15]:
day_with_traffic(df).show()

+-----------+------------------+
|day_of_week|     average_speed|
+-----------+------------------+
|        Fri|               0.0|
|        Wed|               0.0|
|        Mon|1.3578837962634764|
|        Tue| 9.383778567443406|
+-----------+------------------+

