{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# HW3 - Q3 [35 pts]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Important Notices\n", "\n", "
\n", " WARNING: Do NOT add any cells to this Jupyter Notebook, because that will crash the autograder.\n", "
\n", "\n", "
\n", " WARNING: Do NOT implement any additional libraries into this workbook.\n", "
\n", "\n", "All instructions, code comments, etc. in this notebook **are part of the assignment instructions**. That is, if there is instructions about completing a task in this notebook, that task is not optional. \n", "\n", "
\n", " You must implement the following functions in this notebook to receive credit.\n", "
\n", "\n", "`user()`\n", "\n", "`bucket()`\n", "\n", "`long_trips()`\n", "\n", "`manhattan_trips()`\n", "\n", "`weighted_profit()`\n", "\n", "`final_output()`\n", "\n", "Each method will be auto-graded using different sets of parameters or data, to ensure that values are not hard-coded. You may assume we will only use your code to work with data from the NYC-TLC dataset during auto-grading.\n", "\n", "
\n", " WARNING: Do NOT remove or modify the following utility functions:\n", "
\n", "\n", "`load_data()`\n", "\n", "`main()`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Do not change the below cell. Run it to initialize your PySpark instance.\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "9f463b49791a43bb9e66042db165dff7", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox()" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Starting Spark application\n" ] }, { "data": { "text/html": [ "\n", "
IDYARN Application IDKindStateSpark UIDriver logCurrent session?
4application_1603466875534_0005pysparkidleLinkLink
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "SparkSession available as 'spark'.\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "" ] } ], "source": [ "sc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " WARNING: Do NOT remodify the below cell. It contains the function for loading data and all imports, and the function for running your code.\n", "
" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": true }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "ad3f84578305458298bfb359ada4a095", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox()" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#### DO NOT CHANGE ANYTHING IN THIS CELL ####\n", "\n", "from pyspark.sql.functions import col\n", "\n", "def load_data(size='small'):\n", " # Loads the data for this question. Do not change this function.\n", " # This function should only be called with the parameter 'small' or 'large'\n", " \n", " if size != 'small' and size != 'large':\n", " print(\"Invalid size parameter provided. Use only 'small' or 'large'.\")\n", " return\n", " \n", " input_bucket = \"s3://fall2020-cse6242\"\n", " \n", " # Load Trip Data\n", " trip_path = '/'+size+'/yellow_tripdata*'\n", " trips = spark.read.csv(input_bucket + trip_path, header=True, inferSchema=True)\n", " print(\"Trip Count: \",trips.count()) # Prints # of trips (# of records, as each record is one trip)\n", " \n", " # Load Lookup Data\n", " lookup_path = '/'+size+'/taxi*'\n", " lookup = spark.read.csv(input_bucket + lookup_path, header=True, inferSchema=True)\n", " \n", " return trips, lookup\n", "\n", "def main(size, bucket):\n", " # Runs your functions implemented above.\n", " \n", " print(user())\n", " trips, lookup = load_data(size=size)\n", " trips = long_trips(trips)\n", " mtrips = manhattan_trips(trips, lookup)\n", " wp = weighted_profit(trips, mtrips)\n", " final = final_output(wp,lookup)\n", " \n", " # Outputs the results for you to visually see\n", " final.show()\n", " \n", " # Writes out as a CSV to your bucket.\n", " final.write.csv(bucket)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "460b71b5e74b44af8549a244b2b570dd", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox()" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- VendorID: string (nullable = true)\n", " |-- tpep_pickup_datetime: string (nullable = true)\n", " |-- tpep_dropoff_datetime: string (nullable = true)\n", " |-- passenger_count: string (nullable = true)\n", " |-- trip_distance: string (nullable = true)\n", " |-- RatecodeID: string (nullable = true)\n", " |-- store_and_fwd_flag: string (nullable = true)\n", " |-- PULocationID: string (nullable = true)\n", " |-- DOLocationID: string (nullable = true)\n", " |-- payment_type: string (nullable = true)\n", " |-- fare_amount: string (nullable = true)\n", " |-- extra: string (nullable = true)\n", " |-- mta_tax: string (nullable = true)\n", " |-- tip_amount: string (nullable = true)\n", " |-- tolls_amount: string (nullable = true)\n", " |-- improvement_surcharge: string (nullable = true)\n", " |-- total_amount: string (nullable = true)" ] } ], "source": [ "trips.printSchema()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "67668ed1d91f480cbf7bd5f1e68cf5a5", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox()" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- DOLocationID: string (nullable = true)\n", " |-- pcount: double (nullable = true)" ] } ], "source": [ "mtrips.printSchema()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "f948485e92864bbeb4c8a7337c135500", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox()" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- PULocationID: string (nullable = true)\n", " |-- weighted_profit: double (nullable = true)" ] } ], "source": [ "wp.printSchema()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Implement the below functions for this assignment:\n", "
\n", " WARNING: Do NOT change any function inputs or outputs, and ensure that the dataframes your code returns align with the schema definitions commented in each function\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3a. [1 pt] Update the `user()` function\n", "This function should return your GT username, eg: gburdell3" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "4afdd69737d84b29837af8932b6658dc", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox()" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def user():\n", " # Returns a string consisting of your GT username.\n", " return 'helfayoumy3'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3b. [2 pts] Update the `long_trips()` function\n", "This function filters trips to keep only trips longer than 2 miles." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "2862d0db6a5d458583715ab6820ec720", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox()" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def long_trips(trips):\n", " trips = trips.where(col(\"trip_distance\") >= 2.0) \n", " return trips" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3c. [6 pts] Update the `manhattan_trips()` function\n", "\n", "This function determines the top 20 locations with a `DOLocationID` in manhattan by passenger_count (pcount).\n", "\n", "Example output formatting:\n", "\n", "```\n", "+--------------+--------+\n", "| DOLocationID | pcount |\n", "+--------------+--------+\n", "| 5| 15|\n", "| 16| 12| \n", "+--------------+--------+\n", "```" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "51b67409bc8d40199ff7e2481133027a", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox()" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def manhattan_trips(trips, lookup):\n", " # Returns a Dataframe with Schema: DOLocationID, pcount\n", " mtrips = trips.join(lookup, col(\"DOLocationID\") == col(\"LocationID\")).where(col(\"Borough\") == \"Manhattan\")\\\n", " .groupby(col(\"DOLocationID\")).agg({'passenger_count':'sum'})\\\n", " .withColumnRenamed(\"sum(passenger_count)\", \"pcount\")\\\n", " .orderBy(col(\"pcount\").desc()).limit(20)\n", " return mtrips" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3d. [6 pts] Update the `weighted_profit()` function\n", "This function should determine the average `total_amount`, the total count of trips, and the total count of trips ending in the top 20 destinations and return the `weighted_profit` as discussed in the homework document.\n", "\n", "Example output formatting:\n", "```\n", "+--------------+-------------------+\n", "| PULocationID | weighted_profit |\n", "+--------------+-------------------+\n", "| 18| 33.784444421924436| \n", "| 12| 21.124577637149223| \n", "+--------------+-------------------+\n", "```" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "bc16d9d5856e48baa32c37262ffbab04", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox()" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def weighted_profit(trips, mtrips): \n", " # Returns a Dataframe with Schema: PULocationID, weighted_profit\n", " wp1 = trips.groupBy(col(\"PULocationID\")).agg({\"total_amount\":\"mean\", \"PULocationID\":\"count\"})\n", " wp2 = trips.join(mtrips, [\"DOLocationID\"])\n", " wp2 = wp2.groupBy(col(\"PULocationID\")).agg({\"DOLocationID\" : \"count\"})\n", " wp3 = wp1.join(wp2, \"PULocationID\")\n", " wp = wp3.withColumn(\"weighted_profit\", col(\"avg(total_amount)\") * (col(\"count(DOLocationID)\") / col(\"count(PULocationID)\")))\\\n", " .select(\"PULocationID\", \"weighted_profit\").orderBy(col(\"weighted_profit\").desc()).limit(20)\n", " return wp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3e. [5 pts] Update the `final_output()` function\n", "This function will take the results of `weighted_profit`, links it to the `borough` and `zone` and returns the top 20 locations with the highest `weighted_profit`.\n", "\n", "Example output formatting:\n", "```\n", "+------------+---------+-------------------+\n", "| Zone | Borough | weighted_profit |\n", "+----------------------+-------------------+\n", "| JFK Airport| Queens| 16.95897820117925|\n", "| Jamaica| Queens| 14.879835188762488|\n", "+------------+---------+-------------------+\n", "```" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "2f923c99ffa94f8ba80c5af425fb74e4", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox()" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def final_output(wp, lookup): \n", " # Returns a Dataframe with Schema: Zone, Borough, weighted_profit\n", " final = lookup.join(wp, lookup[\"LocationID\"] == wp[\"PULocationID\"])\\\n", " .select(\"Zone\", \"Borough\", \"weighted_profit\")\\\n", " .orderBy(col(\"weighted_profit\").desc()).limit(20)\n", "\n", " return final" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Test your code on the small dataset first, as the large dataset will take a significantly longer time to run\n", "
\n", "\n", "
\n", " WARNING: Do NOT use the same bucket url for multiple runs of the `main()` function, as this will cause errors. Make sure to change the name of your output location every time. (ie: s3://cse6242-gburdell3/output-small2)\n", "
\n", "\n", "Update the below cell with the address to your bucket, then run the below cell to run your code to store the results in S3.\n", "\n", "When you have confirmed the results of the small dataset, run it again using the large dataset. Your output file will appear ina folder in your s3 bucket called YOUROUTPUT.csv as a csv file with a name something like part-0000-4d992f7a-0ad3-48f8-8c72-0022984e4b50-c000.csv. Download this file and rename it to q3_output.csv for submission. Do not make any other changes to the file. " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "ea204ba390134a5dba7984e10016c16e", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox()" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "helfayoumy3\n", "Trip Count: 187203269\n", "+--------------------+-------------+------------------+\n", "| Zone| Borough| weighted_profit|\n", "+--------------------+-------------+------------------+\n", "| Baisley Park| Queens|29.360455779130838|\n", "|Flushing Meadows-...| Queens| 27.30484573361766|\n", "| South Jamaica| Queens|26.294916239873476|\n", "| Randalls Island| Manhattan| 24.15098994022752|\n", "| Astoria Park| Queens| 21.70641711214752|\n", "|Briarwood/Jamaica...| Queens|19.945064631789332|\n", "|Springfield Garde...| Queens|19.468309288781903|\n", "| Jamaica| Queens|19.283943000137885|\n", "| Corona| Queens|18.228769248155974|\n", "| LaGuardia Airport| Queens|18.181338808373006|\n", "| Jamaica Bay| Queens|17.100529446757896|\n", "| Maspeth| Queens|17.005450640079548|\n", "|Eltingville/Annad...|Staten Island| 16.83776475694444|\n", "| JFK Airport| Queens|16.777725348249643|\n", "| Battery Park| Manhattan| 12.84978031114287|\n", "| Morningside Heights| Manhattan| 12.45369802658408|\n", "| Battery Park City| Manhattan|12.448848404428603|\n", "|Greenwich Village...| Manhattan|12.446949891694034|\n", "| Rikers Island| Bronx| 12.3063|\n", "| World Trade Center| Manhattan|12.295411924133381|\n", "+--------------------+-------------+------------------+" ] } ], "source": [ "bucket = 's3://cse6242-helfayoumy3/output-small7'\n", "#main('small',bucket)\n", "main('large', bucket)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Testing\n", "\n", "
\n", " You may use the below cell for any additional testing you need to do, however any code implemented below will not be run or used when grading\n", "
" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "2e671496aba64f25af7fd52278badd6f", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox()" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "helfayoumy3\n", "Trip Count: 187203269\n", "+--------------------+-------------+------------------+\n", "| Zone| Borough| weighted_profit|\n", "+--------------------+-------------+------------------+\n", "| Baisley Park| Queens| 29.36045577913084|\n", "|Flushing Meadows-...| Queens|27.304845733617682|\n", "| South Jamaica| Queens|26.294916239873473|\n", "| Randalls Island| Manhattan| 24.15098994022752|\n", "| Astoria Park| Queens|21.706417112147527|\n", "|Briarwood/Jamaica...| Queens| 19.94506463178933|\n", "|Springfield Garde...| Queens|19.468309288781906|\n", "| Jamaica| Queens|19.283943000137903|\n", "| Corona| Queens|18.228769248155974|\n", "| LaGuardia Airport| Queens|18.181338808373003|\n", "| Jamaica Bay| Queens|17.100529446757896|\n", "| Maspeth| Queens|17.005450640079545|\n", "|Eltingville/Annad...|Staten Island| 16.83776475694445|\n", "| JFK Airport| Queens|16.777725348249636|\n", "| Battery Park| Manhattan|12.849780311142872|\n", "| Morningside Heights| Manhattan|12.453698026584075|\n", "| Battery Park City| Manhattan|12.448848404428599|\n", "|Greenwich Village...| Manhattan|12.446949891694036|\n", "| Rikers Island| Bronx| 12.3063|\n", "| World Trade Center| Manhattan|12.295411924133381|\n", "+--------------------+-------------+------------------+" ] } ], "source": [ "print(user())\n", "trips, lookup = load_data(size='large')\n", "trips = long_trips(trips)\n", "mtrips = manhattan_trips(trips, lookup)\n", "wp = weighted_profit(trips, mtrips)\n", "final = final_output(wp,lookup)\n", "\n", "# Outputs the results for you to visually see\n", "final.show()\n", "\n", "#trips.show(5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "PySpark", "language": "", "name": "pysparkkernel" }, "language_info": { "codemirror_mode": { "name": "python", "version": 2 }, "mimetype": "text/x-python", "name": "pyspark", "pygments_lexer": "python2" } }, "nbformat": 4, "nbformat_minor": 4 }