Files
004_comission/tunmnlu/task_3/Skeleton/Q4/q4_original.ipynb
louiscklaw ae3970ff3c update,
2025-01-31 22:21:55 +08:00

497 lines
14 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# HW3 Q4 [10 pts]\n",
"\n",
"\n",
"\n",
"## Important Notices\n",
"\n",
"<div class=\"alert alert-block alert-danger\">\n",
" WARNING: Do <strong>NOT</strong> add any cells to this Jupyter Notebook, because that will crash the autograder. Additionally, Make sure to delete or comment out any code you add to this notebook for testing purposes prior to submitting the notebook to Gradescope. Failure to do so may crash the autograder. \n",
"</div>\n",
"\n",
"\n",
"All instructions, code comments, etc. in this notebook **are part of the assignment instructions**. That is, if there is instructions about completing a task in this notebook, that task is not optional. \n",
"\n",
"\n",
"\n",
"<div class=\"alert alert-block alert-info\">\n",
" You <strong>must</strong> implement the following functions in this notebook to receive credit.\n",
"</div>\n",
"\n",
"`user()`\n",
"\n",
"`load_data()`\n",
"\n",
"`exclude_no_pickuplocations()`\n",
"\n",
"`exclude_no_tripdistance()`\n",
"\n",
"`include_fare_range()`\n",
"\n",
"`get_highest_tip()`\n",
"\n",
"`get_total_toll()`\n",
"\n",
"Each method will be auto-graded using different sets of parameters or data, to ensure that values are not hard-coded. You may assume we will only use your code to work with data from NYC Taxi Trips during auto-grading. You do not need to write code for unreasonable scenarios. \n",
"\n",
"Since the overall correctness of your code will require multiple function to work together correctly (i.e., all methods are interdepedent), implementing only a subset of the functions likely will lead to a low score.\n",
"\n",
"### Helper functions\n",
"\n",
"You should not use - and do not need - any helper functions; implement the required functions by completing the function skeletons below. All of the fuctions you implement should be self-contained; each function should carry all of the code it needs to execute on its expected input(s) within its body. \n",
"\n",
"### Kernel Selection\n",
"\n",
"Be sure that you select the PySpark kernel from the kernel options above, if it is not already selected. Do not use the Python3 kernel. The notebook should default to the PySpark kernel. \n",
"\n",
"### GCP and Gradescope\n",
"For submitting to Gradescope, you should use spark dataframes and dataframe operations _only_. \n",
"You should not use SQL. \n",
"\n",
"### SparkSession Objects and SparkContexts\n",
"Note that you have access to a SparkSession object in the notebook environment in virtue of using the PySpark kernel. It can be referenced using the variable name _spark_ and does not need to be explicitly created or defined. This beavior can be leveraged to read in the initial dataset. \n",
"\n",
"You should not need to invoke _spark_ in any of your functions but the load_data() function. Do not hardcode references to _spark_ into any of your functions except for the load_data() function. \n",
"\n",
"### Assumptions Regarding Function Inputs\n",
"This question assumes that you're carrying the results of your prior work foward at each stage of data preparation and analysis. Thus, the _input_ to each of the _filtering_ functions below should be the _output_ of the _prior_ function. The inputs to the two analytical functions should be the output of the last filtering function. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Pyspark Imports\n",
"<span style=\"color:red\">*Please don't modify the below cell*</span>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"import pyspark"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"### Student Section - Please compete all the functions below"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Function to return GT Username"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"#export\n",
"def user():\n",
" \"\"\"\n",
" :return: string\n",
" your GTUsername, NOT your 9-Digit GTId \n",
" \"\"\" \n",
" return 'tlou31'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Function to load data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"#export\n",
"def load_data(gcp_storage_path):\n",
" \"\"\"\n",
" :param gcp_storage_path: string (full gs path including file name e.g gs://bucket_name/data.csv) \n",
" :return: spark dataframe \n",
" \"\"\"\n",
" ################################################################\n",
" # code to load yellow_tripdata_2019-01.csv data from your GCP #\n",
" # storage bucket # \n",
" ################################################################"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Function to exclude trips that don't have a pickup location"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"#export\n",
"def exclude_no_pickuplocations(df):\n",
" \"\"\"\n",
" :param nyc tax trips dataframe: spark dataframe \n",
" :return: spark dataframe \n",
" \"\"\"\n",
" ################################################################\n",
" # code to exclude trips with no pickup locations #\n",
" # Note: Exclude nulls and zeros # \n",
" ################################################################"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Function to exclude trips with no distance"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"#export\n",
"def exclude_no_tripdistance(df):\n",
" \"\"\"\n",
" :param nyc tax trips dataframe: spark dataframe \n",
" :return: spark dataframe \n",
" \"\"\"\n",
" ################################################################\n",
" # code to exclude trips with no trip distances #\n",
" # Note: Exclude nulls and zeros # \n",
" ################################################################"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Function to include fare amount between the range of 20 to 60 Dollars"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"#export\n",
"def include_fare_range(df):\n",
" \n",
" \"\"\"\n",
" :param nyc tax trips dataframe: spark dataframe \n",
" :return: spark dataframe \n",
" \"\"\"\n",
" ################################################################\n",
" # code to include trips with only within the fare range of #\n",
" # 20 to 60 dollars (including 20 and 60 dollars) # \n",
" ################################################################"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Function to get the highest tip amount"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"#export\n",
"def get_highest_tip(df):\n",
" \"\"\"\n",
" :param nyc tax trips dataframe: spark dataframe \n",
" :return: decimal (rounded to 2 digits) (NOTE: DON'T USE FLOAT)\n",
" \"\"\"\n",
" \n",
" ################################################################\n",
" # code to get the highest tip amount #\n",
" # # \n",
" ################################################################"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Function to get total toll amount"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"#export\n",
"def get_total_toll(df):\n",
" \"\"\"\n",
" :param nyc tax trips dataframe: spark dataframe \n",
" :return: decimal (rounded to 2 digits) (NOTE: DON'T USE FLOAT)\n",
" \"\"\"\n",
" \n",
" ################################################################\n",
" # code to get total toll amount #\n",
" # # \n",
" ################################################################"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Run above functions and print\n",
"\n",
"#### Uncomment the cells below and test your implemented functions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Load data from yellow_tripdata09-08-2021.csv"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"#gcp_storage_path = \"gs://<replace_with_your_storage_bucket>/yellow_tripdata09-08-2021.csv\"\n",
"#df = load_data(gcp_storage_path)\n",
"#df.printSchema()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Print total numbers of rows in the dataframe"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"#df.count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Print total number of rows in the dataframe after excluding trips with no pickup location"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"#df_no_pickup_locations = exclude_no_pickuplocations(df)\n",
"#df_no_pickup_locations.count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Print total number of rows in the dataframe after exclude trips with no distance"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"#df_no_trip_distance = exclude_no_tripdistance(df_no_pickup_locations)\n",
"#df_no_trip_distance.count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Print total number of rows in the dataframe after including trips with fair amount between the range of 20 to 60 Dollars"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"#df_include_fare_range = include_fare_range(df_no_trip_distance)\n",
"#df_include_fare_range.count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Print the highest tip amount"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"#max_tip = get_highest_tip(df_include_fare_range)\n",
"#print(max_tip)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Print the total toll amount"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"#total_toll = get_total_toll(df_include_fare_range)\n",
"#print(total_toll)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}