tunmnlu/task_2/others-answer/Data-and-Visual-Analytics-529570f9439b68570d11b99bfdfb20b9febcfa62/HW3-helfayoumy3/Q4/q4_pyspark-gcp.ipynb


			
				
					
						
						
						
							
							
							{"nbformat_minor": 2, "cells": [{"source": "# HW3 Q4 [10 pts]\n\n\n\n## Important Notices\n\n<div class=\"alert alert-block alert-danger\">\n    WARNING: Do <strong>NOT</strong> add any cells to this Jupyter Notebook, because that will crash the autograder.\n</div>\n\n\nAll instructions, code comments, etc. in this notebook **are part of the assignment instructions**. That is, if there is instructions about completing a task in this notebook, that task is not optional.  \n\n\n\n<div class=\"alert alert-block alert-info\">\n    You <strong>must</strong> implement the following functions in this notebook to receive credit.\n</div>\n\n`user()`\n\n`load_data()`\n\n`exclude_no_pickuplocations()`\n\n`exclude_no_tripdistance()`\n\n`include_fare_range()`\n\n`get_highest_tip()`\n\n`get_total_toll()`\n\nEach method will be auto-graded using different sets of parameters or data, to ensure that values are not hard-coded.  You may assume we will only use your code to work with data from NYC Taxi Trips during auto-grading. You do not need to write code for unreasonable scenarios.  \n\nSince the overall correctness of your code will require multiple function to work together correctly (i.e., all methods are interdepedent), implementing only a subset of the functions likely will lead to a low score.\n\n### Helper functions\n\nYou are permitted to write additional helper functions, or use additional instance variables so long as the previously described functions work as required.  ", "cell_type": "markdown", "metadata": {}}, {"source": "#### Pyspark Imports\n<span style=\"color:red\">*Please don't modify the below cell*</span>", "cell_type": "markdown", "metadata": {}}, {"execution_count": null, "cell_type": "code", "source": "import pyspark\nfrom pyspark.sql import SQLContext", "outputs": [], "metadata": {}}, {"source": "#### Define Spark Context\n<span style=\"color:red\">*Please don't modify the below cell*</span>", "cell_type": "markdown", "metadata": {}}, {"execution_count": null, "cell_type": "code", "source": "sc\nsqlContext = SQLContext(sc)", "outputs": [], "metadata": {}}, {"source": "### Student Section - Please compete all the functions below", "cell_type": "markdown", "metadata": {}}, {"source": "#### Function to return GT Username", "cell_type": "markdown", "metadata": {}}, {"execution_count": 85, "cell_type": "code", "source": "def user():\n        \"\"\"\n        :return: string\n        your GTUsername, NOT your 9-Digit GTId  \n        \"\"\"         \n        return 'helfayoumy3'", "outputs": [], "metadata": {}}, {"source": "#### Function to load data", "cell_type": "markdown", "metadata": {}}, {"execution_count": 86, "cell_type": "code", "source": "def load_data(gcp_storage_path):\n    \"\"\"\n        :param gcp_storage_path: string (full gs path including file name e.g gs://bucket_name/data.csv) \n        :return: spark dataframe  \n    \"\"\"\n    df = sqlContext.read.option(\"header\",True).csv(\"gs://helfayoumy3/yellow_tripdata_2019-01.csv\")\n    \n    ################################################################\n    # code to load yellow_tripdata_2019-01.csv data from your GCP storage bucket#\n    #                                                              #        \n    ################################################################\n        \n    return df", "outputs": [], "metadata": {}}, {"source": "#### Function to exclude trips that don't have a pickup location", "cell_type": "markdown", "metadata": {}}, {"execution_count": 89, "cell_type": "code", "source": "def exclude_no_pickuplocations(df):\n    \"\"\"\n        :param nyc tax trips dataframe: spark dataframe \n        :return: spark dataframe  \n    \"\"\"\n    from pyspark.sql.functions import col\n\n    df = df.withColumn(\"PULocationID\", col(\"PULocationID\").cast('string'))\n    df = df.where(df.PULocationID.isNotNull())\n    \n    ################################################################\n    # code to exclude trips with no pickup locations               #\n    #                                                              #        \n    ################################################################\n    \n    return df", "outputs": [], "metadata": {}}, {"source": "#### Function to exclude trips with no distance", "cell_type": "markdown", "metadata": {}}, {"execution_count": 93, "cell_type": "code", "source": "def exclude_no_tripdistance(df):\n    \"\"\"\n        :param nyc tax trips dataframe: spark dataframe \n        :return: spark dataframe  \n    \"\"\"\n    from pyspark.sql.functions import col\n\n    df = df.withColumn(\"trip_distance\", col(\"trip_distance\").cast('float'))\n    df = df.where(df.trip_distance.isNotNull())\n    df = df.where(df.trip_distance != '0.0')\n    \n    ################################################################\n    # code to exclude trips with no trip distances                 #\n    #                                                              #        \n    ################################################################\n    \n    return df", "outputs": [], "metadata": {}}, {"source": "#### Function to include fare amount between the range of 20 to 60 Dollars", "cell_type": "markdown", "metadata": {}}, {"execution_count": 99, "cell_type": "code", "source": "def include_fare_range(df):\n    \"\"\"\n        :param nyc tax trips dataframe: spark dataframe \n        :return: spark dataframe  \n    \"\"\"\n    from pyspark.sql.functions import col\n    #df = df.withColumn(\"fare_amount\", col(\"fare_amount\").cast('float'))\n    df = df.filter(col(\"fare_amount\").between(20,60))\n\n    \n    ################################################################\n    # code to include trips with only within the fare range of     #\n    # 20 to 60 dollars                                             #        \n    ################################################################\n    \n    return df", "outputs": [], "metadata": {}}, {"source": "#### Function to get the highest tip amount", "cell_type": "markdown", "metadata": {}}, {"execution_count": 122, "cell_type": "code", "source": "def get_highest_tip(df):\n    \"\"\"\n        :param nyc tax trips dataframe: spark dataframe \n        :return: float (rounded to 2 digits)  \n    \"\"\"\n    import pyspark.sql.functions as F\n    from pyspark.sql.functions import col\n\n    df = df.withColumn(\"tip_amount\", col(\"tip_amount\").cast('float'))\n    max_tip = round(df.agg(F.max(F.abs(df.tip_amount))).first()[0],2)\n    \n    #max_tip = df.orderBy(\"tip_amount\", ascending=False).collect()[0][13]\n\n    ################################################################\n    # code to get the highest tip                                  #\n    #                                                              #        \n    ################################################################\n    \n    return max_tip", "outputs": [], "metadata": {}}, {"source": "#### Function to get total toll amount", "cell_type": "markdown", "metadata": {}}, {"execution_count": 135, "cell_type": "code", "source": "def get_total_toll(df):\n    \"\"\"\n        :param nyc tax trips dataframe: spark dataframe \n        :return: float (rounded to 2 digits)  \n    \"\"\"\n    from pyspark.sql import functions as F\n    from pyspark.sql.functions import col\n\n    df = df.withColumn(\"tolls_amount\", col(\"tolls_amount\").cast('float'))\n    total_toll = round(df.groupBy().agg(F.sum(\"tolls_amount\")).collect()[0][0],2)\n    \n    ################################################################\n    # code to get total toll amount                                #\n    #                                                              #        \n    ################################################################\n    \n    return total_toll", "outputs": [], "metadata": {}}, {"source": "### Run above functions and print\n\n#### Uncomment the cells below and test your implemented functions", "cell_type": "markdown", "metadata": {}}, {"source": "#### Load data from yellow_tripdata_2019-01.csv ", "cell_type": "markdown", "metadata": {}}, {"execution_count": 44, "cell_type": "code", "source": "gcp_storage_path = \"gs://helfayoumy3/yellow_tripdata_2019-01.csv\"\ndf = load_data(gcp_storage_path)\ndf.printSchema()", "outputs": [{"output_type": "stream", "name": "stdout", "text": "root\n |-- VendorID: string (nullable = true)\n |-- tpep_pickup_datetime: string (nullable = true)\n |-- tpep_dropoff_datetime: string (nullable = true)\n |-- passenger_count: string (nullable = true)\n |-- trip_distance: string (nullable = true)\n |-- RatecodeID: string (nullable = true)\n |-- store_and_fwd_flag: string (nullable = true)\n |-- PULocationID: string (nullable = true)\n |-- DOLocationID: string (nullable = true)\n |-- payment_type: string (nullable = true)\n |-- fare_amount: string (nullable = true)\n |-- extra: string (nullable = true)\n |-- mta_tax: string (nullable = true)\n |-- tip_amount: string (nullable = true)\n |-- tolls_amount: string (nullable = true)\n |-- improvement_surcharge: string (nullable = true)\n |-- total_amount: string (nullable = true)\n |-- congestion_surcharge: string (nullable = true)\n\n"}], "metadata": {}}, {"source": "#### Print total numbers of rows in the dataframe", "cell_type": "markdown", "metadata": {}}, {"execution_count": 45, "cell_type": "code", "source": "df.count()", "outputs": [{"execution_count": 45, "output_type": "execute_result", "data": {"text/plain": "7667792"}, "metadata": {}}], "metadata": {}}, {"source": "#### Print total number of rows in the dataframe after excluding trips with no pickup location", "cell_type": "markdown", "metadata": {}}, {"execution_count": 91, "cell_type": "code", "source": "df_no_pickup_locations = exclude_no_pickuplocations(df)\ndf_no_pickup_locations.count()", "outputs": [{"execution_count": 91, "output_type": "execute_result", "data": {"text/plain": "7667792"}, "metadata": {}}], "metadata": {}}, {"source": "#### Print total number of rows in the dataframe after exclude trips with no distance", "cell_type": "markdown", "metadata": {}}, {"execution_count": 94, "cell_type": "code", "source": "df_no_trip_distance = exclude_no_tripdistance(df_no_pickup_locations)\ndf_no_trip_distance.count()", "outputs": [{"execution_count": 94, "output_type": "execute_result", "data": {"text/plain": "7613022"}, "metadata": {}}], "metadata": {}}, {"source": "#### Print total number of rows in the dataframe after including trips with fair amount between the range of 20 to 60 Dollars", "cell_type": "markdown", "metadata": {}}, {"execution_count": 100, "cell_type": "code", "source": "df_include_fare_range = include_fare_range(df_no_trip_distance)\ndf_include_fare_range.count()", "outputs": [{"execution_count": 100, "output_type": "execute_result", "data": {"text/plain": "957922"}, "metadata": {}}], "metadata": {}}, {"source": "#### Print the highest tip amount", "cell_type": "markdown", "metadata": {}}, {"execution_count": 123, "cell_type": "code", "source": "max_tip = get_highest_tip(df_include_fare_range)\nprint(max_tip)", "outputs": [{"output_type": "stream", "name": "stdout", "text": "444.8\n"}], "metadata": {}}, {"source": "#### Print the total toll amount", "cell_type": "markdown", "metadata": {}}, {"execution_count": 136, "cell_type": "code", "source": "total_toll = get_total_toll(df_include_fare_range)\nprint(total_toll)", "outputs": [{"output_type": "stream", "name": "stdout", "text": "2096233.31\n"}], "metadata": {}}, {"execution_count": null, "cell_type": "code", "source": "", "outputs": [], "metadata": {}}], "nbformat": 4, "metadata": {"kernelspec": {"display_name": "PySpark", "name": "pyspark", "language": "python"}, "language_info": {"mimetype": "text/x-python", "nbconvert_exporter": "python", "version": "2.7.14", "name": "python", "file_extension": ".py", "pygments_lexer": "ipython2", "codemirror_mode": {"version": 2, "name": "ipython"}}}}
						
						
					
				
				
					
						Reference in New Issue
					
					View Git Blame
					Copy Permalink