tunmnlu/task_2/others-answer/DVA-CSE6242-Spring2021-main/hw3-Hadoop,Spark,Pig,AWS,Azure/HW3-mvegesna3/Q1/q1.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# HW3 - Q1 [15 pts]\n",
    "\n",
    "\n",
    "\n",
    "## Important Notices\n",
    "\n",
    "<div class=\"alert alert-block alert-danger\">\n",
    "    WARNING: Do <strong>NOT</strong> add any cells to this Jupyter Notebook, because that will crash the autograder.\n",
    "</div>\n",
    "\n",
    "\n",
    "All instructions, code comments, etc. in this notebook **are part of the assignment instructions**. That is, if there is instructions about completing a task in this notebook, that task is not optional.  \n",
    "\n",
    "\n",
    "\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "    You <strong>must</strong> implement the following functions in this notebook to receive credit.\n",
    "</div>\n",
    "\n",
    "\n",
    "`user()`\n",
    "\n",
    "`clean_data()`\n",
    "\n",
    "`common_pair()`\n",
    "\n",
    "`time_of_cheapest_fare()`\n",
    "\n",
    "`passenger_count_for_most_tip()`\n",
    "\n",
    "`day_with_traffic()`\n",
    "\n",
    "\n",
    "\n",
    "Each method will be auto-graded using different sets of parameters or data, to ensure that values are not hard-coded. You may assume we will only use your code to work with data from the NYC-TLC dataset during auto-grading. \n",
    "\n",
    "### Helper functions\n",
    "\n",
    "You are permitted to write additional helper functions, or use additional instance variables so long as the previously described functions work as required.\n",
    "\n",
    "<div class=\"alert alert-block alert-danger\">\n",
    "    WARNING: Do <strong>NOT</strong> remove or modify the following utility functions:\n",
    "</div>\n",
    "\n",
    "`load_data()`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Pyspark Imports\n",
    "<span style=\"color:red\">*Please don't modify the below cell*</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pyspark\n",
    "from pyspark.sql import SQLContext\n",
    "from pyspark.sql.functions import hour, when, col, date_format, to_timestamp\n",
    "from pyspark.sql.functions import *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Define Spark Context\n",
    "<span style=\"color:red\">*Please don't modify the below cell*</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc = pyspark.SparkContext(appName=\"HW3-Q1\")\n",
    "sqlContext = SQLContext(sc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Function to load data\n",
    "\n",
    "<span style=\"color:red\">*Please don't modify the below cell*</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_data():\n",
    "    df = sqlContext.read.option(\"header\",True).csv(\"yellow_tripdata_2019-01_short.csv\")\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- VendorID: string (nullable = true)\n",
      " |-- tpep_pickup_datetime: string (nullable = true)\n",
      " |-- tpep_dropoff_datetime: string (nullable = true)\n",
      " |-- passenger_count: string (nullable = true)\n",
      " |-- trip_distance: string (nullable = true)\n",
      " |-- RatecodeID: string (nullable = true)\n",
      " |-- store_and_fwd_flag: string (nullable = true)\n",
      " |-- PULocationID: string (nullable = true)\n",
      " |-- DOLocationID: string (nullable = true)\n",
      " |-- payment_type: string (nullable = true)\n",
      " |-- fare_amount: string (nullable = true)\n",
      " |-- extra: string (nullable = true)\n",
      " |-- mta_tax: string (nullable = true)\n",
      " |-- tip_amount: string (nullable = true)\n",
      " |-- tolls_amount: string (nullable = true)\n",
      " |-- improvement_surcharge: string (nullable = true)\n",
      " |-- total_amount: string (nullable = true)\n",
      " |-- congestion_surcharge: string (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df = load_data()\n",
    "df.printSchema()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Implement the functions below for this assignment:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Update the `user()` function\n",
    "This function should return your GT username, eg: gburdell3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def user():\n",
    "    \"\"\"\n",
    "    :return: string\n",
    "    your GTUsername, NOT your 9-Digit GTId  \n",
    "    \"\"\"  \n",
    "    username = \"mvegesna3\"\n",
    "    return username"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1a. [1 pts] Casting the columns into correct types\n",
    "\n",
    "To process the data accurately, cast the following columns into given data type: \n",
    "\n",
    "- `passenger_count` - integer \n",
    "- `total_amount` - float \n",
    "- `tip_amount` - float\n",
    "- `trip_distance` - float \n",
    "- `fare_amount` - float \n",
    "- `tpep_pickup_datetime` - timestamp \n",
    "- `tpep_dropoff_datetime` - timestamp \n",
    "\n",
    "All of the columns in the original data should be returned with the above columns converted to the correct data type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def clean_data(df):\n",
    "    '''\n",
    "    input: df a dataframe\n",
    "    output: df a dataframe with the all the original columns\n",
    "    '''\n",
    "    \n",
    "    # START YOUR CODE HERE ---------\n",
    "    \n",
    "    df = df.withColumn(\"passenger_count\", df.passenger_count.cast(\"integer\"))\n",
    "    df = df.withColumn(\"total_amount\", df.total_amount.cast(\"float\"))\n",
    "    df = df.withColumn(\"tip_amount\", df.tip_amount.cast(\"float\"))\n",
    "    df = df.withColumn(\"trip_distance\", df.trip_distance.cast(\"float\"))\n",
    "    df = df.withColumn(\"fare_amount\", df.fare_amount.cast(\"float\"))\n",
    "    df = df.withColumn(\"tpep_pickup_datetime\", df.tpep_pickup_datetime.cast(\"timestamp\"))\n",
    "    df = df.withColumn(\"tpep_dropoff_datetime\", df.tpep_dropoff_datetime.cast(\"timestamp\"))\n",
    "    \n",
    "    # END YOUR CODE HERE -----------\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- passenger_count: integer (nullable = true)\n",
      " |-- total_amount: float (nullable = true)\n",
      " |-- tip_amount: float (nullable = true)\n",
      " |-- trip_distance: float (nullable = true)\n",
      " |-- fare_amount: float (nullable = true)\n",
      " |-- tpep_pickup_datetime: timestamp (nullable = true)\n",
      " |-- tpep_pickup_datetime: timestamp (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df = clean_data(df)\n",
    "df.select(['passenger_count', 'total_amount', 'tip_amount', 'trip_distance', 'fare_amount', 'tpep_pickup_datetime', 'tpep_pickup_datetime']).printSchema()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1b. [4 pts] What are the top 10 pickup-dropoff locations?\n",
    "\n",
    "Find the top 10 pickup-dropoff location pairs having the most number of trips (`count`). The location pairs should be ordered by `count` in descending order. If two or more pairs have the same number of trips, break the tie using the trip amount per distance travelled (`trip_rate`). Use columns `total_amount` and `trip_distance` to calculate the trip amount per distance. In certain situations, the pickup and dropoff locations may be the same.\n",
    "\n",
    "Example output showing expected formatting:\n",
    "\n",
    "```\n",
    "+------------+------------+-----+------------------+\n",
    "|PULocationID|DOLocationID|count|         trip_rate|\n",
    "+------------+------------+-----+------------------+\n",
    "|           5|           7|   24| 5.148195749283391|\n",
    "|           6|           4|   19| 1.420958193039484|\n",
    "|           3|           2|   15|9.1928382713049282|\n",
    "|           8|           8|   14|5.1029384838178493|\n",
    "|           1|           3|    9|7.4403919838271223|\n",
    "|           9|           2|    9|4.4039182884283829|\n",
    "|           5|           7|    6|  5.19283827172823|\n",
    "|           2|           1|    5| 9.233738511638532|\n",
    "|           1|           9|    3| 8.293827128489212|\n",
    "|           6|           6|    1| 4.192847382919223|\n",
    "+------------+------------+-----+------------------+\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def common_pair(df):\n",
    "    '''\n",
    "    input: df a dataframe\n",
    "    output: df a dataframe with following columns:\n",
    "            - PULocationID\n",
    "            - DOLocationID\n",
    "            - count\n",
    "            - trip_rate\n",
    "            \n",
    "    trip_rate is the average amount (total_amount) per distance (trip_distance)\n",
    "    \n",
    "    '''\n",
    "    \n",
    "    # START YOUR CODE HERE ---------\n",
    "    \n",
    "    #getting count for each pickup and drop off location\n",
    "    count_df = df.groupBy(\"PULocationID\", \"DOLocationID\").count()\n",
    "    \n",
    "    #calculating trip rate\n",
    "    avg_total_amount = pyspark.sql.functions.avg('total_amount')\n",
    "    avg_trip_distance = pyspark.sql.functions.avg('trip_distance')\n",
    "    trip_rate_df = df.groupby(\"PULocationID\", \"DOLocationID\").agg(avg_total_amount/avg_trip_distance)\n",
    "    \n",
    "    #combining the dataframes\n",
    "    out = count_df.alias(\"a\").join(trip_rate_df.alias(\"b\"), (count_df.PULocationID == trip_rate_df.PULocationID) & (count_df.DOLocationID == trip_rate_df.DOLocationID))\n",
    "    \n",
    "    #getting the required columns\n",
    "    out = out.withColumn('trip_rate',col(\"(avg(total_amount) / avg(trip_distance))\"))\n",
    "    out = out.drop(col(\"(avg(total_amount) / avg(trip_distance))\"))\n",
    "    out = out.drop(col(\"b.PULocationID\"))\n",
    "    out = out.drop(col(\"b.DOLocationID\"))\n",
    "    out = out.distinct()\n",
    "    \n",
    "    #ordering and limiting to 10\n",
    "    out = out.orderBy(['count','trip_rate'],ascending = False).limit(10)\n",
    "    \n",
    "    df = out\n",
    "    \n",
    "    # END YOUR CODE HERE -----------\n",
    "    \n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------------+------------+-----+------------------+\n",
      "|PULocationID|DOLocationID|count|         trip_rate|\n",
      "+------------+------------+-----+------------------+\n",
      "|         264|         264|   97| 5.482259531398455|\n",
      "|         239|         238|   34| 8.395489315120459|\n",
      "|         237|         236|   34|7.1150794250423965|\n",
      "|         236|         236|   24|12.230708730972086|\n",
      "|          79|          79|   23|10.641212116864102|\n",
      "|         142|         239|   23|10.056728351507015|\n",
      "|         148|          79|   23|  9.72959679025766|\n",
      "|         263|         141|   23| 7.301437441278104|\n",
      "|         141|         263|   22| 6.897755674061171|\n",
      "|         170|         170|   21| 9.681594815392343|\n",
      "+------------+------------+-----+------------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "common_pair(df).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1c. [4 pts] When is the trip cheapest (day vs night) ?\n",
    "\n",
    "Divide each day into two parts: Day (from 9am to 8:59:59pm), and Night (from 9pm to 8:59:59am) and find the average total amount per unit distance travelled (use column `total_amount`) for both time periods. Sort the result by `trip_rate` in ascending order to find when the fare rate is cheapest.\n",
    "\n",
    "Example output showing expected formatting:\n",
    "```\n",
    "+---------+-----------------+\n",
    "|day_night|        trip_rate|\n",
    "+---------+-----------------+\n",
    "|      Day|2.391827482920123|\n",
    "|    Night|4.292818223839121|\n",
    "+---------+-----------------+\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def time_of_cheapest_fare(df):\n",
    "    '''\n",
    "    input: df a dataframe\n",
    "    output: df a dataframe with following columns:\n",
    "            - day_night\n",
    "            - trip_rate\n",
    "    \n",
    "    day_night will have 'Day' or 'Night' based on following conditions:\n",
    "        - From 9am to 8:59:59pm - Day\n",
    "        - From 9pm to 8:59:59am - Night\n",
    "            \n",
    "    trip_rate is the average amount (total_amount) per distance\n",
    "    \n",
    "    '''\n",
    "    \n",
    "    # START YOUR CODE HERE ---------\n",
    "    \n",
    "    #time\n",
    "    cheap = df.withColumn('time', date_format(col(\"tpep_pickup_datetime\"),'HH:mm:ss'))\n",
    "    cheap = cheap.withColumn(\"day_night\", when(col('time').between('09:00:00','20:59:59'),\"Day\").otherwise(\"Night\"))\n",
    "\n",
    "    #trip rate\n",
    "    avg_total_amount = pyspark.sql.functions.avg('total_amount')\n",
    "    avg_trip_distance = pyspark.sql.functions.avg('trip_distance')\n",
    "    cheap = cheap.groupBy(\"day_night\").agg(avg_total_amount/avg_trip_distance)\n",
    "    cheap = cheap.withColumn('trip_rate',col(\"(avg(total_amount) / avg(trip_distance))\"))\n",
    "    cheap = cheap.drop(col(\"(avg(total_amount) / avg(trip_distance))\"))\n",
    "    df = cheap.select(\"day_night\",\"trip_rate\")\n",
    "    \n",
    "    # END YOUR CODE HERE -----------\n",
    "    \n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------+-----------------+\n",
      "|day_night|        trip_rate|\n",
      "+---------+-----------------+\n",
      "|    Night|5.537330237759072|\n",
      "|      Day|6.694883573588345|\n",
      "+---------+-----------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "time_of_cheapest_fare(df).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1d. [3 pts] Which passenger group size gives the most tips?\n",
    "\n",
    "Filter the data for trips having fares (`fare_amount`) greater than $2 and the number of passengers (`passenger_count`) greater than 0. Find the average fare and tip (`tip_amount`) for all passenger group sizes and calculate the tip percent (`tip_amount * 100 / fare_amount`). Sort by the tip percent in descending order to get which group size tips most generously.\n",
    "\n",
    "Example output showing expected formatting:\n",
    "```\n",
    "+---------------+------------------+\n",
    "|passenger_count|       tip_percent|\n",
    "+---------------+------------------+\n",
    "|              4|20.129473829283771|\n",
    "|              2|16.203913838738283|\n",
    "|              3|14.283814930283822|\n",
    "|              1|13.393817383918287|\n",
    "|              6| 12.73928273747182|\n",
    "|              5|12.402938192848471|\n",
    "+---------------+------------------+\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "def passenger_count_for_most_tip(df):\n",
    "    '''\n",
    "    input: df a dataframe\n",
    "    output: df a dataframe with following columns:\n",
    "            - passenger_count\n",
    "            - tip_percent\n",
    "            \n",
    "    trip_percent is the percent of tip out of fare_amount\n",
    "    \n",
    "    '''\n",
    "    \n",
    "    # START YOUR CODE HERE ---------\n",
    "    \n",
    "    df1 = df.select(['fare_amount','passenger_count','tip_amount'])\n",
    "    df1 = df1.filter((col('fare_amount') >2 ) & (col('passenger_count')> 0))\n",
    "\n",
    "    df2 = df1.groupBy(['passenger_count']).agg(pyspark.sql.functions.avg('tip_amount') *100 /pyspark.sql.functions.avg('fare_amount'))\n",
    "    df2 = df2.withColumn('tip_percent',col(\"((avg(tip_amount) * 100) / avg(fare_amount))\")).drop(col(\"((avg(tip_amount) * 100) / avg(fare_amount))\")) \\\n",
    "    .orderBy(['tip_percent'],ascending = False)\n",
    "    \n",
    "    \n",
    "    # END YOUR CODE HERE -----------\n",
    "    \n",
    "    return df2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------------+------------------+\n",
      "|passenger_count|       tip_percent|\n",
      "+---------------+------------------+\n",
      "|              2|14.406226490643459|\n",
      "|              5|14.347176091134632|\n",
      "|              1|13.816545488299298|\n",
      "|              4|13.232489331848662|\n",
      "|              3| 13.11005447222511|\n",
      "|              6|13.096846913969197|\n",
      "+---------------+------------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "passenger_count_for_most_tip(df).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1e. [3 pts] Which day of week has the most traffic?\n",
    "\n",
    "Sort the days of the week based on traffic, with the day having the highest traffic on the top. You can estimate traffic on the day of the week based on the average speed of all taxi trips on that day of the week. (Speed can be calculated by using the trip time and trip distance. Make sure to print speed in distance / hour). If the `average_speed` is the same for two or more days, the days should be ordered alphabetically. A day with a low average speed indicates high levels of traffic. The average speed may be 0 indicating very high levels of traffic. Not all days of the week may be present. You should use `date_format` along with the appropriate [pattern letters](https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html) to format the day of the week.\n",
    "\n",
    "Example output showing expected formatting:\n",
    "```\n",
    "+-----------+------------------+\n",
    "|day_of_week|     average_speed|\n",
    "+-----------+------------------+\n",
    "|        Sat|               0.0|\n",
    "|        Tue|               0.0|\n",
    "|        Fri|7.2938133827293934|\n",
    "|        Mon|10.123938472718228|\n",
    "+-----------+------------------+\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "def day_with_traffic(df):\n",
    "    '''\n",
    "    input: df a dataframe\n",
    "    output: df a dataframe with following columns:\n",
    "            - day_of_week\n",
    "            - average_speed\n",
    "    \n",
    "    day_of_week should be day of week e.g.) Mon, Tue, Wed, ...\n",
    "    average_speed (miles/hour) is calculated as distance / time (in hours)\n",
    "    \n",
    "    '''\n",
    "    \n",
    "    # START YOUR CODE HERE ---------\n",
    "    \n",
    "    #day of the week \n",
    "    df = df.withColumn(\"day_of_week\", date_format(col('tpep_pickup_datetime'), 'E'))\n",
    "    df = df.withColumn(\"dropoff_time\", to_timestamp(date_format(\"tpep_dropoff_datetime\", \"yyyy-MM-dd HH:mm:ss\"))).withColumn(\"dropoff_time\", col(\"dropoff_time\").cast(\"long\"))\n",
    "    df = df.withColumn(\"pickup_time\", to_timestamp(date_format(\"tpep_pickup_datetime\", \"yyyy-MM-dd HH:mm:ss\"))).withColumn(\"pickup_time\", col(\"pickup_time\").cast(\"long\"))\n",
    "    \n",
    "    #average speed\n",
    "    df1 = df.withColumn(\"time\",((df.dropoff_time - df.pickup_time)/3600))\n",
    "    df1 = df1.withColumn(\"speed\",(col(\"trip_distance\")/col(\"time\"))).groupBy(\"day_of_week\").agg({\"speed\":\"avg\"})\n",
    "    df1 = df1.withColumn(\"average_speed\", col(\"avg(speed)\")).drop(\"avg(speed)\")\n",
    "    df1 = df1.orderBy(col(\"average_speed\"),col(\"day_of_week\"))\n",
    "    \n",
    "    # END YOUR CODE HERE -----------\n",
    "    \n",
    "    return df1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----------+------------------+\n",
      "|day_of_week|     average_speed|\n",
      "+-----------+------------------+\n",
      "|        Fri|               0.0|\n",
      "|        Wed|               0.0|\n",
      "|        Mon|11.788108388798078|\n",
      "|        Tue|12.168344692771637|\n",
      "+-----------+------------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "day_with_traffic(df).show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}