{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "cr_lUENJdGW1" }, "source": [ "\n", "\n", "\n", "[Pandas](http://pandas.pydata.org/) is a Python library that provides data structures and functions for fast, easy, and expressive manipulation of *structured data*.\n", "\n", "\n", "\n", "\n", "- It provides two main data structures: the `Series` which holds a **1-dimensional sequence** of ***homogeneous*** values, and the `DataFrame`, which holds a ***tabular***, ***heterogeneous*** dataset.\n", "\n", "- It also contains a large number of functions and methods to manipulate and summarize `Series` and `DataFrame` objects.\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "yU_LgG6WdGW2" }, "outputs": [], "source": [ "import pandas as pd # abbreviated as pd conventionally" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "id": "Uv3PBPdW8mU6", "outputId": "bb76b304-b0df-418c-d9e1-ca819db08816" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'2.0.3'" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" } }, "metadata": {}, "execution_count": 2 } ], "source": [ "pd.__version__" ] }, { "cell_type": "markdown", "metadata": { "id": "1pUM0xM48RWJ" }, "source": [ "# 1 `Series` and `DataFrame`\n", "\n", "A [`DataFrame`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) represents a ***2-dimensional***, ***tabular*** data structure containing an ***ordered*** collection of columns, each of which can be a different value type (numeric, string, Boolean, etc.).\n", "\n", "Previously in web scraping, we have seen:\n", "```python\n", "lie_df = pd.DataFrame({'date': date_list, 'lie': lie_list, 'explanation': explanation_list, 'url': url_list})\n", "```\n", "\n", "\n", "A `DataFrame` can be thought of as a specialization of a Python dictionary. It maps names (i.e., column names or indcies) to a sequence of data series that share the same set of labels (i.e., row names or indices).\n", "\n", "\n", "\n", "\n", "
\n", "\n", "Let's build up the above `DataFrame` from scratch based on this component view (column by column):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Tu_Te5Ga8HXb", "outputId": "fc65c8bb-a16d-4d87-f6f7-b27323cbfe42" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 Wan Chai\n", "1 North\n", "2 Sai Kung\n", "3 Sha Tin\n", "dtype: object" ] }, "metadata": {}, "execution_count": 3 } ], "source": [ "# https://pandas.pydata.org/docs/reference/api/pandas.Series.html\n", "\n", "# a Series can be thought of as a 1-dimensional array with attached labels\n", "# a set of default indices, consisting of the integers 0 through n-1, are automatically attached\n", "district_name = pd.Series(['Wan Chai', 'North', 'Sai Kung', 'Sha Tin'])\n", "district_name" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "afo4Oh1ziCzs", "outputId": "1dd3b6c3-0a3f-4e1f-a08f-2313582bdbe8" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array(['Wan Chai', 'North', 'Sai Kung', 'Sha Tin'], dtype=object)" ] }, "metadata": {}, "execution_count": 4 } ], "source": [ "# Return Series as array\n", "district_name.values" ] }, { "cell_type": "markdown", "metadata": { "id": "B8OMPSkv2Nu6" }, "source": [ "`Array` is similar to `List`, but it requires all elements to be of the same data type. This characteristic is beneficial for certain operations, especially those that are mathematically intensive, as it allows for more efficient data processing." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Im-OAkGQiJnc", "outputId": "2b34d92b-3761-49ec-bc9a-a60080774b9d" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "RangeIndex(start=0, stop=4, step=1)" ] }, "metadata": {}, "execution_count": 5 } ], "source": [ "# Return the index of the Series.\n", "district_name.index" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "hp7S6BsD279f", "outputId": "3aad52f1-afca-4817-df33-c77c5764d294" }, "outputs": [ { "data": { "text/plain": [ "[0, 1, 2, 3]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(district_name.index)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "65HQ7NA-8HXc", "outputId": "0520f329-f831-4294-aa21-dfdb58a0971d" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 150900\n", "1 310800\n", "2 448600\n", "3 648200\n", "dtype: int64" ] }, "metadata": {}, "execution_count": 6 } ], "source": [ "district_population = pd.Series([150900, 310800, 448600, 648200])\n", "district_population" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "C4CTdE0Z8HXc", "outputId": "87cb7bec-8851-434d-8c84-d49459a74c31" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 9.83\n", "1 136.61\n", "2 129.65\n", "3 68.71\n", "dtype: float64" ] }, "metadata": {}, "execution_count": 7 } ], "source": [ "district_area = pd.Series([9.83, 136.61, 129.65, 68.71])\n", "district_area" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 174 }, "id": "B7w5X4KR8HXd", "outputId": "cbcd4762-d979-4b6a-f689-6f880a43a70b" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " District Population Area\n", "0 Wan Chai 150900 9.83\n", "1 North 310800 136.61\n", "2 Sai Kung 448600 129.65\n", "3 Sha Tin 648200 68.71" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DistrictPopulationArea
0Wan Chai1509009.83
1North310800136.61
2Sai Kung448600129.65
3Sha Tin64820068.71
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "HK_district1", "summary": "{\n \"name\": \"HK_district1\",\n \"rows\": 4,\n \"fields\": [\n {\n \"column\": \"District\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 4,\n \"samples\": [\n \"North\",\n \"Sha Tin\",\n \"Wan Chai\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 210983,\n \"min\": 150900,\n \"max\": 648200,\n \"num_unique_values\": 4,\n \"samples\": [\n 310800,\n 648200,\n 150900\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Area\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 59.35022493638925,\n \"min\": 9.83,\n \"max\": 136.61,\n \"num_unique_values\": 4,\n \"samples\": [\n 136.61,\n 68.71,\n 9.83\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 8 } ], "source": [ "# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html\n", "\n", "HK_district1 = pd.DataFrame({'District': district_name,\n", " 'Population': district_population,\n", " 'Area': district_area})\n", "HK_district1" ] }, { "cell_type": "markdown", "metadata": { "id": "J5p1mL3o8HXe" }, "source": [ "A `Series` can also be created with user supplied index.\n", "\n", "Apart from making data more readable, the explicit index definition gives the `Series` object additional capabilities such as label-based selection and operation alignment." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "qxBb1vHP8HXe", "outputId": "69f9e3a0-fbf4-442d-bd2b-400b79b7ea7f" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "Sai Kung 448600\n", "Sha Tin 648200\n", "Wan Chai 150900\n", "North 310800\n", "dtype: int64" ] }, "metadata": {}, "execution_count": 9 } ], "source": [ "district_population2 = pd.Series([448600, 648200, 150900, 310800],\n", " index=['Sai Kung', 'Sha Tin', 'Wan Chai', 'North'])\n", "district_population2" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "vSYLyA4_8HXf", "outputId": "9dea60e2-b95d-4179-9af4-17c8900e59b3" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "Wan Chai 9.83\n", "North 136.61\n", "Sai Kung 129.65\n", "Sha Tin 68.71\n", "dtype: float64" ] }, "metadata": {}, "execution_count": 10 } ], "source": [ "district_area2 = pd.Series([9.83, 136.61, 129.65, 68.71],\n", " index=['Wan Chai', 'North', 'Sai Kung', 'Sha Tin'])\n", "district_area2" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 174 }, "id": "oDgT-R7G8HXg", "outputId": "37fbac32-2599-4768-b69d-8a2df9d94064" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Population Area\n", "North 310800 136.61\n", "Sai Kung 448600 129.65\n", "Sha Tin 648200 68.71\n", "Wan Chai 150900 9.83" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PopulationArea
North310800136.61
Sai Kung448600129.65
Sha Tin64820068.71
Wan Chai1509009.83
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "HK_district2", "summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 4,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 210983,\n \"min\": 150900,\n \"max\": 648200,\n \"num_unique_values\": 4,\n \"samples\": [\n 448600,\n 150900,\n 310800\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Area\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 59.35022493638925,\n \"min\": 9.83,\n \"max\": 136.61,\n \"num_unique_values\": 4,\n \"samples\": [\n 129.65,\n 9.83,\n 136.61\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 11 } ], "source": [ "HK_district2 = pd.DataFrame({'Population': district_population2, 'Area': district_area2})\n", "HK_district2" ] }, { "cell_type": "markdown", "metadata": { "id": "K_PNxiU08HXg" }, "source": [ "The data from the two `Series` are ***aligned via index labels*** (also sorted in the result).\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "0w_xGMOuJfpT", "outputId": "ebd5a526-39c0-4863-83e1-c670dff7aa02" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "Index(['District', 'Population', 'Area'], dtype='object')" ] }, "metadata": {}, "execution_count": 12 } ], "source": [ "# Return the column labels of the DataFrame\n", "HK_district1.columns" ] }, { "cell_type": "markdown", "metadata": { "id": "xqTcdEWs8HXh" }, "source": [ "Individual columns of a `DataFrame` can be accessed with dictionary-style indexing." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "0wZPsdmz8HXh", "outputId": "b64b3f74-b56c-4528-db8f-868a11373a37" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 150900\n", "1 310800\n", "2 448600\n", "3 648200\n", "Name: Population, dtype: int64" ] }, "metadata": {}, "execution_count": 13 } ], "source": [ "HK_district1['Population']" ] }, { "cell_type": "markdown", "metadata": { "id": "zWEFJPSg4AqL" }, "source": [ "They can also be accessed using the attribute reference notation as if they are the attributes of a `DataFrame`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "meGHKttf8HXh", "outputId": "de99d8a3-2d8f-4ee0-b169-07ccce39c930" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "North 136.61\n", "Sai Kung 129.65\n", "Sha Tin 68.71\n", "Wan Chai 9.83\n", "Name: Area, dtype: float64" ] }, "metadata": {}, "execution_count": 14 } ], "source": [ "HK_district2.Area" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1rB6e1_wVhux", "outputId": "070555a2-cab7-42d8-883f-11979108cbd3" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "North 310800\n", "Sai Kung 448600\n", "Sha Tin 648200\n", "Wan Chai 150900\n", "Name: Population, dtype: int64" ] }, "metadata": {}, "execution_count": 15 } ], "source": [ "HK_district2.Population" ] }, { "cell_type": "markdown", "metadata": { "id": "t_RelPLOVQJc" }, "source": [ "Because pandas is built on top of NumPy, `Series` and `DataFrame` objects support **vectorized operations**. Vectorized operations are a powerful feature of Python that allow you to apply a function or an operation to multiple elements of an array or a dataframe at once, instead of using loops. This can save time, improve your code readability, and reduce your memory usage." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 174 }, "id": "msdHWhpMACwm", "outputId": "fc324144-caaa-4e03-d12f-32fa1d508796" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Population Area Density\n", "North 310800 136.61 2275.089671\n", "Sai Kung 448600 129.65 3460.084844\n", "Sha Tin 648200 68.71 9433.852423\n", "Wan Chai 150900 9.83 15350.966429" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PopulationAreaDensity
North310800136.612275.089671
Sai Kung448600129.653460.084844
Sha Tin64820068.719433.852423
Wan Chai1509009.8315350.966429
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "HK_district2", "summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 4,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 210983,\n \"min\": 150900,\n \"max\": 648200,\n \"num_unique_values\": 4,\n \"samples\": [\n 448600,\n 150900,\n 310800\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Area\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 59.35022493638925,\n \"min\": 9.83,\n \"max\": 136.61,\n \"num_unique_values\": 4,\n \"samples\": [\n 129.65,\n 9.83,\n 136.61\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Density\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6025.790769823514,\n \"min\": 2275.0896713271354,\n \"max\": 15350.966429298067,\n \"num_unique_values\": 4,\n \"samples\": [\n 3460.0848438102585,\n 15350.966429298067,\n 2275.0896713271354\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 17 } ], "source": [ "# this assignment form of indexing creates a new column\n", "HK_district2['Density'] = HK_district2.Population / HK_district2.Area\n", "HK_district2" ] }, { "cell_type": "markdown", "metadata": { "id": "XWj8QYIEo3FU" }, "source": [ "There are many ways to create a DataFrame. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html\n", "\n", "Another usual way of creating a DataFrame is by using `pd.DataFrame(data, columns=[column_names],index=[row_names])` explicitly.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 112 }, "id": "h-eiSsbHpH08", "outputId": "55260d3a-e87b-4a3e-c87d-7752ea272105" }, "outputs": [ { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "summary": "{\n \"name\": \"df\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"a\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.7071067811865476,\n \"min\": 3.0,\n \"max\": 4.0,\n \"num_unique_values\": 2,\n \"samples\": [\n 4.0,\n 3.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"b\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.7071067811865476,\n \"min\": 7.0,\n \"max\": 8.0,\n \"num_unique_values\": 2,\n \"samples\": [\n 7.0,\n 8.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", "type": "dataframe", "variable_name": "df" }, "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
first3.08.0
second4.07.0
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "text/plain": [ " a b\n", "first 3.0 8.0\n", "second 4.0 7.0" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(data=[[3.0, 8.0], [4.0, 7.0]], columns=['a', 'b'], index=['first','second'])\n", "df" ] }, { "cell_type": "markdown", "metadata": { "id": "n9aeaNrcdGYS" }, "source": [ "\n", "---\n", "\n", "
\n", "\n", "# 2 Data Selection in `DataFrame`s\n", "\n", "\n", "`DataFrame` support both ***label-based*** indexing and ***location-based*** indexing.\n", "\n", "Pandas provids two indexer attributes that explicitly expose which indexing scheme to apply:\n", "\n", "- The `loc` attribute uses ***label-based*** indexing:\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "BA4iaMZrdGYc", "outputId": "bd8ee1c3-bdc5-48bd-c73a-be507095c2a7" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "129.65" ] }, "metadata": {}, "execution_count": 18 } ], "source": [ "HK_district2.loc['Sai Kung', 'Area']" ] }, { "cell_type": "markdown", "metadata": { "id": "5Mc4hyUU7qzY" }, "source": [ "- The `loc` attribute can be used for slicing based on labels. It can handle slices, single labels, and lists of labels. Both the start and the stop of the slice are inclusive." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "id": "DExCstxjdGYf", "outputId": "95aa7d0c-03ba-46a6-d470-82e9bdc27ab2" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Population Area\n", "Sai Kung 448600 129.65\n", "Sha Tin 648200 68.71\n", "Wan Chai 150900 9.83" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PopulationArea
Sai Kung448600129.65
Sha Tin64820068.71
Wan Chai1509009.83
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 250257,\n \"min\": 150900,\n \"max\": 648200,\n \"num_unique_values\": 3,\n \"samples\": [\n 448600,\n 648200,\n 150900\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Area\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 59.912951298807954,\n \"min\": 9.83,\n \"max\": 129.65,\n \"num_unique_values\": 3,\n \"samples\": [\n 129.65,\n 68.71,\n 9.83\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 19 } ], "source": [ "# slicing selects contiguous rows and columns\n", "# but the last label in inclusive this time\n", "\n", "HK_district2.loc['Sai Kung':'Wan Chai', :'Area']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 112 }, "id": "stvLLsCBdGYg", "outputId": "d450f23e-ffdb-4883-e656-59f29363cf47" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Density Population\n", "Sai Kung 3460.084844 448600\n", "Wan Chai 15350.966429 150900" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DensityPopulation
Sai Kung3460.084844448600
Wan Chai15350.966429150900
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \" ['Density', 'Population']]\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"Density\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 8408.123003384675,\n \"min\": 3460.0848438102585,\n \"max\": 15350.966429298067,\n \"num_unique_values\": 2,\n \"samples\": [\n 15350.966429298067,\n 3460.0848438102585\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 210505,\n \"min\": 150900,\n \"max\": 448600,\n \"num_unique_values\": 2,\n \"samples\": [\n 150900,\n 448600\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 20 } ], "source": [ "# using indexing lists (or tuples) can select non-contiguous rows and columns\n", "# can also present them in a different order, e.g., make Density precede Population\n", "\n", "HK_district2.loc[['Sai Kung', 'Wan Chai'],\n", " ['Density', 'Population']]" ] }, { "cell_type": "markdown", "metadata": { "id": "p3bklxjn_xY6" }, "source": [ "Boolean indexing selects items that satisfy certain criteria; important for data filtering." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 112 }, "id": "Q9P1vA4h_0Wt", "outputId": "9ebdd628-fe53-4bcc-d8b2-dd426fad8704" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Population Density\n", "North 310800 2275.089671\n", "Sha Tin 648200 9433.852423" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PopulationDensity
North3108002275.089671
Sha Tin6482009433.852423
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 238577,\n \"min\": 310800,\n \"max\": 648200,\n \"num_unique_values\": 2,\n \"samples\": [\n 648200,\n 310800\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Density\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5062.009686774815,\n \"min\": 2275.0896713271354,\n \"max\": 9433.852423228062,\n \"num_unique_values\": 2,\n \"samples\": [\n 9433.852423228062,\n 2275.0896713271354\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 21 } ], "source": [ "import numpy as np\n", "HK_district2.loc[np.array([True, False, True, False]), ['Population', 'Density']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 112 }, "id": "g3gB6wGvdGYk", "outputId": "77bd1fe1-30cf-4d9b-9273-e3e4c310cd52" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Population Density\n", "North 310800 2275.089671\n", "Sai Kung 448600 3460.084844" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PopulationDensity
North3108002275.089671
Sai Kung4486003460.084844
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 97439,\n \"min\": 310800,\n \"max\": 448600,\n \"num_unique_values\": 2,\n \"samples\": [\n 448600,\n 310800\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Density\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 837.9181221361389,\n \"min\": 2275.0896713271354,\n \"max\": 3460.0848438102585,\n \"num_unique_values\": 2,\n \"samples\": [\n 3460.0848438102585,\n 2275.0896713271354\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 23 } ], "source": [ "# differet types of indexing (and slicing) can be mixedly used\n", "\n", "HK_district2.loc[HK_district2.Area > 100, ['Population', 'Density']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "2_ZI-qRWXyCS", "outputId": "850b5fc6-0dd6-43ab-efc9-0414a4a3229c" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "North True\n", "Sai Kung True\n", "Sha Tin False\n", "Wan Chai False\n", "Name: Area, dtype: bool" ] }, "metadata": {}, "execution_count": 22 } ], "source": [ "HK_district2.Area > 100" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 81 }, "id": "dYhhQnjf4AKP", "outputId": "5c1d0050-ee59-4d51-cdef-3d46283773c6" }, "outputs": [ { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "summary": "{\n \"name\": \" ['Population', 'Density']]\",\n \"rows\": 1,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 648200,\n \"max\": 648200,\n \"num_unique_values\": 1,\n \"samples\": [\n 648200\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Density\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 9433.852423228062,\n \"max\": 9433.852423228062,\n \"num_unique_values\": 1,\n \"samples\": [\n 9433.852423228062\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", "type": "dataframe" }, "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PopulationDensity
Sha Tin6482009433.852423
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ], "text/plain": [ " Population Density\n", "Sha Tin 648200 9433.852423" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Boolean operators are ~, &, and | are used for selection\n", "\n", "HK_district2.loc[~(HK_district2.Area > 100) & (HK_district2.Population > 200000),\n", " ['Population', 'Density']]" ] }, { "cell_type": "markdown", "metadata": { "id": "MShWN8sU5Iei" }, "source": [ "Pandas also provide a handy helper function that allows us to query data with less verbose query strings: `query()` method. It is a powerful tool for filtering `DataFrame` rows using a concise and readable expression syntax.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 80 }, "id": "jzylxSwi5ZS9", "outputId": "c0aea66c-db32-48fc-b443-7b696b929640" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Population Area Density\n", "Sha Tin 648200 68.71 9433.852423" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PopulationAreaDensity
Sha Tin64820068.719433.852423
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 1,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 648200,\n \"max\": 648200,\n \"num_unique_values\": 1,\n \"samples\": [\n 648200\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Area\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 68.71,\n \"max\": 68.71,\n \"num_unique_values\": 1,\n \"samples\": [\n 68.71\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Density\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 9433.852423228062,\n \"max\": 9433.852423228062,\n \"num_unique_values\": 1,\n \"samples\": [\n 9433.852423228062\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 24 } ], "source": [ "HK_district2.query('~ (Area > 100) & (Population > 200000)')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 80 }, "id": "7S3OQroFBBko", "outputId": "728ddb31-44a9-42cf-97da-a975c7b685cb" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Population Area Density\n", "Sai Kung 448600 129.65 3460.084844" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PopulationAreaDensity
Sai Kung448600129.653460.084844
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 1,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 448600,\n \"max\": 448600,\n \"num_unique_values\": 1,\n \"samples\": [\n 448600\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Area\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 129.65,\n \"max\": 129.65,\n \"num_unique_values\": 1,\n \"samples\": [\n 129.65\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Density\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 3460.0848438102585,\n \"max\": 3460.0848438102585,\n \"num_unique_values\": 1,\n \"samples\": [\n 3460.0848438102585\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 25 } ], "source": [ "HK_district2.query('index == \"Sai Kung\"')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 112 }, "id": "uo9WD616KiVw", "outputId": "cb8072a2-7914-4d63-d41e-22c725cd38c8" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Population Area Density\n", "Sai Kung 448600 129.65 3460.084844\n", "Sha Tin 648200 68.71 9433.852423" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PopulationAreaDensity
Sai Kung448600129.653460.084844
Sha Tin64820068.719433.852423
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 141138,\n \"min\": 448600,\n \"max\": 648200,\n \"num_unique_values\": 2,\n \"samples\": [\n 648200,\n 448600\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Area\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 43.09108724550821,\n \"min\": 68.71,\n \"max\": 129.65,\n \"num_unique_values\": 2,\n \"samples\": [\n 68.71,\n 129.65\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Density\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 4224.091564638677,\n \"min\": 3460.0848438102585,\n \"max\": 9433.852423228062,\n \"num_unique_values\": 2,\n \"samples\": [\n 9433.852423228062,\n 3460.0848438102585\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 27 } ], "source": [ "# Can take a 1-argument function. The x passed to the lambda is the DataFrame being sliced.\n", "\n", "HK_district2.loc[lambda x: [i[0]=='S' for i in x.index], :]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "NMrEnfVBBh1c", "outputId": "bef02d96-02bb-45c0-ac39-07c065f0ffea" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "Index(['North', 'Sai Kung', 'Sha Tin', 'Wan Chai'], dtype='object')" ] }, "metadata": {}, "execution_count": 26 } ], "source": [ "HK_district2.index" ] }, { "cell_type": "markdown", "metadata": { "id": "9sR1oXZNCZxZ" }, "source": [ "If the second argument (column labels) is omitted, `.loc` will return all columns for the specified rows." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 112 }, "id": "EpR31NUiCVIJ", "outputId": "5c368e77-16a7-488f-cf04-ca70a85d0408" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Population Area Density\n", "Sai Kung 448600 129.65 3460.084844\n", "Sha Tin 648200 68.71 9433.852423" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PopulationAreaDensity
Sai Kung448600129.653460.084844
Sha Tin64820068.719433.852423
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 141138,\n \"min\": 448600,\n \"max\": 648200,\n \"num_unique_values\": 2,\n \"samples\": [\n 648200,\n 448600\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Area\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 43.09108724550821,\n \"min\": 68.71,\n \"max\": 129.65,\n \"num_unique_values\": 2,\n \"samples\": [\n 68.71,\n 129.65\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Density\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 4224.091564638677,\n \"min\": 3460.0848438102585,\n \"max\": 9433.852423228062,\n \"num_unique_values\": 2,\n \"samples\": [\n 9433.852423228062,\n 3460.0848438102585\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 28 } ], "source": [ "HK_district2.loc[lambda x: [i[0]=='S' for i in x.index]]" ] }, { "cell_type": "markdown", "metadata": { "id": "K76qH-PcdGYo" }, "source": [ "- The `iloc` attribute uses Python-style ***location-based*** indexing:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 174 }, "id": "XKhgaIyNY9th", "outputId": "03735dfe-5d16-4420-fd39-16e58290e8da" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Population Area Density\n", "North 310800 136.61 2275.089671\n", "Sai Kung 448600 129.65 3460.084844\n", "Sha Tin 648200 68.71 9433.852423\n", "Wan Chai 150900 9.83 15350.966429" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PopulationAreaDensity
North310800136.612275.089671
Sai Kung448600129.653460.084844
Sha Tin64820068.719433.852423
Wan Chai1509009.8315350.966429
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "HK_district2", "summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 4,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 210983,\n \"min\": 150900,\n \"max\": 648200,\n \"num_unique_values\": 4,\n \"samples\": [\n 448600,\n 150900,\n 310800\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Area\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 59.35022493638925,\n \"min\": 9.83,\n \"max\": 136.61,\n \"num_unique_values\": 4,\n \"samples\": [\n 129.65,\n 9.83,\n 136.61\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Density\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6025.790769823514,\n \"min\": 2275.0896713271354,\n \"max\": 15350.966429298067,\n \"num_unique_values\": 4,\n \"samples\": [\n 3460.0848438102585,\n 15350.966429298067,\n 2275.0896713271354\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 29 } ], "source": [ "HK_district2" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "iI2cWX6xdGYp", "outputId": "328bea5b-fe28-4de2-c0b7-4eeb7cd0139b" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "129.65" ] }, "metadata": {}, "execution_count": 30 } ], "source": [ "# 0-based indexing; starting from zero\n", "HK_district2.iloc[1, 1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "id": "hDbkTrEqZISx", "outputId": "38b90700-cc1b-4e21-d9e5-feeae05b61cf" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Population Area\n", "Sai Kung 448600 129.65\n", "Sha Tin 648200 68.71\n", "Wan Chai 150900 9.83" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PopulationArea
Sai Kung448600129.65
Sha Tin64820068.71
Wan Chai1509009.83
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 250257,\n \"min\": 150900,\n \"max\": 648200,\n \"num_unique_values\": 3,\n \"samples\": [\n 448600,\n 648200,\n 150900\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Area\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 59.912951298807954,\n \"min\": 9.83,\n \"max\": 129.65,\n \"num_unique_values\": 3,\n \"samples\": [\n 129.65,\n 68.71,\n 9.83\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 31 } ], "source": [ "# the last index is exclusive as with regular Python slicing\n", "HK_district2.iloc[1:4, :2]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 112 }, "id": "ymCsTpxxaD_l", "outputId": "4f87e9ed-06df-4de8-b5cb-1f82a5203b38" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Density Population\n", "Sai Kung 3460.084844 448600\n", "Wan Chai 15350.966429 150900" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DensityPopulation
Sai Kung3460.084844448600
Wan Chai15350.966429150900
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"Density\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 8408.123003384675,\n \"min\": 3460.0848438102585,\n \"max\": 15350.966429298067,\n \"num_unique_values\": 2,\n \"samples\": [\n 15350.966429298067,\n 3460.0848438102585\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 210505,\n \"min\": 150900,\n \"max\": 448600,\n \"num_unique_values\": 2,\n \"samples\": [\n 150900,\n 448600\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 32 } ], "source": [ "# select non-contiguous rows and columns\n", "HK_district2.iloc[[1, 3], [2, 0]]" ] }, { "cell_type": "markdown", "metadata": { "id": "PpWB3RFKEE0W" }, "source": [ "The `.iloc` is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 112 }, "id": "wuPmilQLdGYy", "outputId": "c7398335-05e0-46e1-e226-46d61252bde0", "scrolled": true }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Population Density\n", "North 310800 2275.089671\n", "Sai Kung 448600 3460.084844" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PopulationDensity
North3108002275.089671
Sai Kung4486003460.084844
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 97439,\n \"min\": 310800,\n \"max\": 448600,\n \"num_unique_values\": 2,\n \"samples\": [\n 448600,\n 310800\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Density\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 837.9181221361389,\n \"min\": 2275.0896713271354,\n \"max\": 3460.0848438102585,\n \"num_unique_values\": 2,\n \"samples\": [\n 3460.0848438102585,\n 2275.0896713271354\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 34 } ], "source": [ "# what HK_district2.Area returns is a Series\n", "# iloc can only take a NumPy array, which can be accessed via .values\n", "HK_district2.iloc[(HK_district2.Area > 100).values, [0, 2]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Gr7Yk_SHNh0h", "outputId": "65c0428b-0cbd-47fb-c9cc-056fba605057" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([ True, True, False, False])" ] }, "metadata": {}, "execution_count": 33 } ], "source": [ "(HK_district2.Area > 100).values" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 112 }, "id": "WmyrwCr5Nmaj", "outputId": "5c7eb563-c70c-453b-8ade-2d52b028a6e7" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Population Density\n", "North 310800 2275.089671\n", "Sai Kung 448600 3460.084844" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PopulationDensity
North3108002275.089671
Sai Kung4486003460.084844
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 97439,\n \"min\": 310800,\n \"max\": 448600,\n \"num_unique_values\": 2,\n \"samples\": [\n 448600,\n 310800\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Density\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 837.9181221361389,\n \"min\": 2275.0896713271354,\n \"max\": 3460.0848438102585,\n \"num_unique_values\": 2,\n \"samples\": [\n 3460.0848438102585,\n 2275.0896713271354\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 35 } ], "source": [ "HK_district2.iloc[(HK_district2.Area > 100).values, [True, False, True]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 112 }, "id": "PyjoPDkNIkZj", "outputId": "d41fa286-2138-4993-9711-bd421e8d957c" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Area\n", "North 136.61\n", "Sha Tin 68.71" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Area
North136.61
Sha Tin68.71
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \" lambda x: [i[0]=='A' for i in x\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"Area\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 48.01255044256659,\n \"min\": 68.71,\n \"max\": 136.61,\n \"num_unique_values\": 2,\n \"samples\": [\n 68.71,\n 136.61\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 36 } ], "source": [ "# Can take a 1-argument function. The x passed to the lambda is the DataFrame being sliced.\n", "\n", "HK_district2.iloc[lambda x: [i for i in range(len(x)) if i % 2 == 0],\n", " lambda x: [i[0]=='A' for i in x.columns]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "-Z5fakGdEWN6", "outputId": "9b72ed50-3ce6-473b-9643-1a2bae31369e" }, "outputs": [ { "data": { "text/plain": [ "range(0, 4)" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "range(len(HK_district2))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "e0p_b4FFEiaG", "outputId": "1fd23d16-d543-4eb7-aa6f-6c1e6843d597" }, "outputs": [ { "data": { "text/plain": [ "Index(['Population', 'Area', 'Density'], dtype='object')" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HK_district2.columns" ] }, { "cell_type": "markdown", "metadata": { "id": "2TQdkZzoqI8Y" }, "source": [ "*Exercise*:\n", "Can you select district(s) whose name is shorter than 6 characters and show their `Population`?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "jHOnVubqsDTn", "colab": { "base_uri": "https://localhost:8080/", "height": 80 }, "outputId": "f8a50661-4fac-48e5-8e3a-c4da479b8964" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Population\n", "North 310800" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Population
North310800
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 1,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 310800,\n \"max\": 310800,\n \"num_unique_values\": 1,\n \"samples\": [\n 310800\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 37 } ], "source": [ "# write your code here\n", "\n", "HK_district2.iloc[lambda x: [len(i)<6 for i in x.index],[0]]" ] }, { "cell_type": "markdown", "metadata": { "id": "gN_jwMl-dGZy" }, "source": [ "---\n", "\n", "
\n", "\n", "# 3 Importing and Exporting Data\n", "\n", "Pandas features a number of [functions](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) for reading tabular data as a `DataFrame` object. Among them, [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) is likely the one we'll use the most:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 174 }, "id": "UTYNJD03Yb0b", "outputId": "4128a707-1d22-4c27-a37c-836760f98730" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Date GOOG APPL AMZN\n", "0 2015/5/1 537.900024 120.220688 422.869995\n", "1 2015/5/4 540.780029 119.987633 423.040009\n", "2 2015/5/5 530.799988 117.283951 421.190002\n", "3 2015/5/6 524.219971 116.547424 419.100006" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateGOOGAPPLAMZN
02015/5/1537.900024120.220688422.869995
12015/5/4540.780029119.987633423.040009
22015/5/5530.799988117.283951421.190002
32015/5/6524.219971116.547424419.100006
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "sp", "summary": "{\n \"name\": \"sp\",\n \"rows\": 4,\n \"fields\": [\n {\n \"column\": \"Date\",\n \"properties\": {\n \"dtype\": \"object\",\n \"num_unique_values\": 4,\n \"samples\": [\n \"2015/5/4\",\n \"2015/5/6\",\n \"2015/5/1\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"GOOG\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 7.432934330450454,\n \"min\": 524.219971,\n \"max\": 540.780029,\n \"num_unique_values\": 4,\n \"samples\": [\n 540.780029,\n 524.219971,\n 537.900024\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"APPL\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1.8676860370617223,\n \"min\": 116.547424,\n \"max\": 120.220688,\n \"num_unique_values\": 4,\n \"samples\": [\n 119.987633,\n 116.547424,\n 120.220688\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"AMZN\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1.834355725235249,\n \"min\": 419.100006,\n \"max\": 423.040009,\n \"num_unique_values\": 4,\n \"samples\": [\n 423.040009,\n 419.100006,\n 422.869995\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 38 } ], "source": [ "sp = pd.read_csv(\"https://raw.githubusercontent.com/justinjiajia/datafiles/master/adj_closing_sub.csv\")\n", "sp" ] }, { "cell_type": "markdown", "metadata": { "id": "BPXlruYGdGZ2" }, "source": [ "The corresponding writer functions are object methods that are accessed like [`DataFrame.to_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html):\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Hmt1EnUJdGZ5" }, "outputs": [], "source": [ "sp.to_csv(\"stockprice_new.csv\")" ] }, { "cell_type": "markdown", "metadata": { "id": "v38QUl8udGZ8" }, "source": [ "\n", "---\n", "\n", "
\n", "\n", "# 4 Computing Summary and Descriptive Statistics\n", "\n", "\n", "`DataFrame` objects are equipped with common mathematical and statistical [methods](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#computations-descriptive-stats) for column-wise computations (or row-wise by setting `axis=1`):\n", "\n", "- Most of them produce aggregates:\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Ke5EX-BYdGZ9", "outputId": "86254c4e-b11b-4da9-cfe0-63ee26dee4c6" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "GOOG 533.425003\n", "APPL 118.509924\n", "dtype: float64" ] }, "metadata": {}, "execution_count": 39 } ], "source": [ "sp[['GOOG', 'APPL']].mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "nBb16pzt8mX3", "outputId": "a37118d5-d770-4007-8801-b4a4b0acb28f" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "Date 4\n", "APPL 4\n", "dtype: int64" ] }, "metadata": {}, "execution_count": 40 } ], "source": [ "sp[['Date', 'APPL']].nunique()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "pgedhvgAZjhs", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "c383a0fe-8a75-4381-d2ad-61462e03a8cb" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['AMZN',\n", " 'APPL',\n", " 'Date',\n", " 'GOOG',\n", " 'T',\n", " '_AXIS_LEN',\n", " '_AXIS_ORDERS',\n", " '_AXIS_TO_AXIS_NUMBER',\n", " '_HANDLED_TYPES',\n", " '__abs__',\n", " '__add__',\n", " '__and__',\n", " '__annotations__',\n", " '__array__',\n", " '__array_priority__',\n", " '__array_ufunc__',\n", " '__bool__',\n", " '__class__',\n", " '__contains__',\n", " '__copy__',\n", " '__dataframe__',\n", " '__deepcopy__',\n", " '__delattr__',\n", " '__delitem__',\n", " '__dict__',\n", " '__dir__',\n", " '__divmod__',\n", " '__doc__',\n", " '__eq__',\n", " '__finalize__',\n", " '__floordiv__',\n", " '__format__',\n", " '__ge__',\n", " '__getattr__',\n", " '__getattribute__',\n", " '__getitem__',\n", " '__getstate__',\n", " '__gt__',\n", " '__hash__',\n", " '__iadd__',\n", " '__iand__',\n", " '__ifloordiv__',\n", " '__imod__',\n", " '__imul__',\n", " '__init__',\n", " '__init_subclass__',\n", " '__invert__',\n", " '__ior__',\n", " '__ipow__',\n", " '__isub__',\n", " '__iter__',\n", " '__itruediv__',\n", " '__ixor__',\n", " '__le__',\n", " '__len__',\n", " '__lt__',\n", " '__matmul__',\n", " '__mod__',\n", " '__module__',\n", " '__mul__',\n", " '__ne__',\n", " '__neg__',\n", " '__new__',\n", " '__nonzero__',\n", " '__or__',\n", " '__pos__',\n", " '__pow__',\n", " '__radd__',\n", " '__rand__',\n", " '__rdivmod__',\n", " '__reduce__',\n", " '__reduce_ex__',\n", " '__repr__',\n", " '__rfloordiv__',\n", " '__rmatmul__',\n", " '__rmod__',\n", " '__rmul__',\n", " '__ror__',\n", " '__round__',\n", " '__rpow__',\n", " '__rsub__',\n", " '__rtruediv__',\n", " '__rxor__',\n", " '__setattr__',\n", " '__setitem__',\n", " '__setstate__',\n", " '__sizeof__',\n", " '__str__',\n", " '__sub__',\n", " '__subclasshook__',\n", " '__truediv__',\n", " '__weakref__',\n", " '__xor__',\n", " '_accessors',\n", " '_accum_func',\n", " '_add_numeric_operations',\n", " '_agg_examples_doc',\n", " '_agg_summary_and_see_also_doc',\n", " '_align_frame',\n", " '_align_series',\n", " '_append',\n", " '_arith_method',\n", " '_as_manager',\n", " '_attrs',\n", " '_box_col_values',\n", " '_can_fast_transpose',\n", " '_check_inplace_and_allows_duplicate_labels',\n", " '_check_inplace_setting',\n", " '_check_is_chained_assignment_possible',\n", " '_check_label_or_level_ambiguity',\n", " '_check_setitem_copy',\n", " '_clear_item_cache',\n", " '_clip_with_one_bound',\n", " '_clip_with_scalar',\n", " '_cmp_method',\n", " '_combine_frame',\n", " '_consolidate',\n", " '_consolidate_inplace',\n", " '_construct_axes_dict',\n", " '_construct_result',\n", " '_constructor',\n", " '_constructor_sliced',\n", " '_create_data_for_split_and_tight_to_dict',\n", " '_data',\n", " '_dir_additions',\n", " '_dir_deletions',\n", " '_dispatch_frame_op',\n", " '_drop_axis',\n", " '_drop_labels_or_levels',\n", " '_ensure_valid_index',\n", " '_find_valid_index',\n", " '_flags',\n", " '_from_arrays',\n", " '_get_agg_axis',\n", " '_get_axis',\n", " '_get_axis_name',\n", " '_get_axis_number',\n", " '_get_axis_resolvers',\n", " '_get_block_manager_axis',\n", " '_get_bool_data',\n", " '_get_cleaned_column_resolvers',\n", " '_get_column_array',\n", " '_get_index_resolvers',\n", " '_get_item_cache',\n", " '_get_label_or_level_values',\n", " '_get_numeric_data',\n", " '_get_value',\n", " '_getitem_bool_array',\n", " '_getitem_multilevel',\n", " '_getitem_nocopy',\n", " '_gotitem',\n", " '_hidden_attrs',\n", " '_indexed_same',\n", " '_info_axis',\n", " '_info_axis_name',\n", " '_info_axis_number',\n", " '_info_repr',\n", " '_init_mgr',\n", " '_inplace_method',\n", " '_internal_names',\n", " '_internal_names_set',\n", " '_is_copy',\n", " '_is_homogeneous_type',\n", " '_is_label_or_level_reference',\n", " '_is_label_reference',\n", " '_is_level_reference',\n", " '_is_mixed_type',\n", " '_is_view',\n", " '_iset_item',\n", " '_iset_item_mgr',\n", " '_iset_not_inplace',\n", " '_item_cache',\n", " '_iter_column_arrays',\n", " '_ixs',\n", " '_join_compat',\n", " '_logical_func',\n", " '_logical_method',\n", " '_maybe_cache_changed',\n", " '_maybe_update_cacher',\n", " '_metadata',\n", " '_mgr',\n", " '_min_count_stat_function',\n", " '_needs_reindex_multi',\n", " '_protect_consolidate',\n", " '_reduce',\n", " '_reduce_axis1',\n", " '_reindex_axes',\n", " '_reindex_columns',\n", " '_reindex_index',\n", " '_reindex_multi',\n", " '_reindex_with_indexers',\n", " '_rename',\n", " '_replace_columnwise',\n", " '_repr_data_resource_',\n", " '_repr_fits_horizontal_',\n", " '_repr_fits_vertical_',\n", " '_repr_html_',\n", " '_repr_latex_',\n", " '_reset_cache',\n", " '_reset_cacher',\n", " '_sanitize_column',\n", " '_series',\n", " '_set_axis',\n", " '_set_axis_name',\n", " '_set_axis_nocheck',\n", " '_set_is_copy',\n", " '_set_item',\n", " '_set_item_frame_value',\n", " '_set_item_mgr',\n", " '_set_value',\n", " '_setitem_array',\n", " '_setitem_frame',\n", " '_setitem_slice',\n", " '_slice',\n", " '_stat_axis',\n", " '_stat_axis_name',\n", " '_stat_axis_number',\n", " '_stat_function',\n", " '_stat_function_ddof',\n", " '_take',\n", " '_take_with_is_copy',\n", " '_to_dict_of_blocks',\n", " '_to_latex_via_styler',\n", " '_typ',\n", " '_update_inplace',\n", " '_validate_dtype',\n", " '_values',\n", " '_where',\n", " 'abs',\n", " 'add',\n", " 'add_prefix',\n", " 'add_suffix',\n", " 'agg',\n", " 'aggregate',\n", " 'align',\n", " 'all',\n", " 'any',\n", " 'apply',\n", " 'applymap',\n", " 'asfreq',\n", " 'asof',\n", " 'assign',\n", " 'astype',\n", " 'at',\n", " 'at_time',\n", " 'attrs',\n", " 'axes',\n", " 'backfill',\n", " 'between_time',\n", " 'bfill',\n", " 'bool',\n", " 'boxplot',\n", " 'clip',\n", " 'columns',\n", " 'combine',\n", " 'combine_first',\n", " 'compare',\n", " 'convert_dtypes',\n", " 'copy',\n", " 'corr',\n", " 'corrwith',\n", " 'count',\n", " 'cov',\n", " 'cummax',\n", " 'cummin',\n", " 'cumprod',\n", " 'cumsum',\n", " 'describe',\n", " 'diff',\n", " 'div',\n", " 'divide',\n", " 'dot',\n", " 'drop',\n", " 'drop_duplicates',\n", " 'droplevel',\n", " 'dropna',\n", " 'dtypes',\n", " 'duplicated',\n", " 'empty',\n", " 'eq',\n", " 'equals',\n", " 'eval',\n", " 'ewm',\n", " 'expanding',\n", " 'explode',\n", " 'ffill',\n", " 'fillna',\n", " 'filter',\n", " 'first',\n", " 'first_valid_index',\n", " 'flags',\n", " 'floordiv',\n", " 'from_dict',\n", " 'from_records',\n", " 'ge',\n", " 'get',\n", " 'groupby',\n", " 'gt',\n", " 'head',\n", " 'hist',\n", " 'iat',\n", " 'idxmax',\n", " 'idxmin',\n", " 'iloc',\n", " 'index',\n", " 'infer_objects',\n", " 'info',\n", " 'insert',\n", " 'interpolate',\n", " 'isetitem',\n", " 'isin',\n", " 'isna',\n", " 'isnull',\n", " 'items',\n", " 'iterrows',\n", " 'itertuples',\n", " 'join',\n", " 'keys',\n", " 'kurt',\n", " 'kurtosis',\n", " 'last',\n", " 'last_valid_index',\n", " 'le',\n", " 'loc',\n", " 'lt',\n", " 'mask',\n", " 'max',\n", " 'mean',\n", " 'median',\n", " 'melt',\n", " 'memory_usage',\n", " 'merge',\n", " 'min',\n", " 'mod',\n", " 'mode',\n", " 'mul',\n", " 'multiply',\n", " 'ndim',\n", " 'ne',\n", " 'nlargest',\n", " 'notna',\n", " 'notnull',\n", " 'nsmallest',\n", " 'nunique',\n", " 'pad',\n", " 'pct_change',\n", " 'pipe',\n", " 'pivot',\n", " 'pivot_table',\n", " 'plot',\n", " 'pop',\n", " 'pow',\n", " 'prod',\n", " 'product',\n", " 'quantile',\n", " 'query',\n", " 'radd',\n", " 'rank',\n", " 'rdiv',\n", " 'reindex',\n", " 'reindex_like',\n", " 'rename',\n", " 'rename_axis',\n", " 'reorder_levels',\n", " 'replace',\n", " 'resample',\n", " 'reset_index',\n", " 'rfloordiv',\n", " 'rmod',\n", " 'rmul',\n", " 'rolling',\n", " 'round',\n", " 'rpow',\n", " 'rsub',\n", " 'rtruediv',\n", " 'sample',\n", " 'select_dtypes',\n", " 'sem',\n", " 'set_axis',\n", " 'set_flags',\n", " 'set_geometry',\n", " 'set_index',\n", " 'shape',\n", " 'shift',\n", " 'size',\n", " 'skew',\n", " 'sort_index',\n", " 'sort_values',\n", " 'squeeze',\n", " 'stack',\n", " 'std',\n", " 'style',\n", " 'sub',\n", " 'subtract',\n", " 'sum',\n", " 'swapaxes',\n", " 'swaplevel',\n", " 'tail',\n", " 'take',\n", " 'to_clipboard',\n", " 'to_csv',\n", " 'to_dict',\n", " 'to_excel',\n", " 'to_feather',\n", " 'to_gbq',\n", " 'to_hdf',\n", " 'to_html',\n", " 'to_json',\n", " 'to_latex',\n", " 'to_markdown',\n", " 'to_numpy',\n", " 'to_orc',\n", " 'to_parquet',\n", " 'to_period',\n", " 'to_pickle',\n", " 'to_records',\n", " 'to_sql',\n", " 'to_stata',\n", " 'to_string',\n", " 'to_timestamp',\n", " 'to_xarray',\n", " 'to_xml',\n", " 'transform',\n", " 'transpose',\n", " 'truediv',\n", " 'truncate',\n", " 'tz_convert',\n", " 'tz_localize',\n", " 'unstack',\n", " 'update',\n", " 'value_counts',\n", " 'values',\n", " 'var',\n", " 'where',\n", " 'xs']" ] }, "metadata": {}, "execution_count": 41 } ], "source": [ "dir(sp)" ] }, { "cell_type": "markdown", "metadata": { "id": "NUufnYU-8mX4" }, "source": [ "The following table summarizes some built-in Pandas aggregations:\n", "\n", "| Aggregation | Description |\n", "|--------------------------|---------------------------------|\n", "| ``count()`` | Total number of items |\n", "| ``nunique()`` | Number of distinct items\n", "| ``mean()``, ``median()`` | Mean and median |\n", "| ``min()``, ``max()`` | Minimum and maximum |\n", "| ``std()``, ``var()`` | Standard deviation and variance |\n", "| ``mad()`` | Mean absolute deviation |\n", "| ``prod()`` | Product of all items |\n", "| ``sum()`` | Sum of all items |\n" ] }, { "cell_type": "markdown", "metadata": { "id": "ZpV-jgcddGZ_" }, "source": [ "- Some statistics are computed from pairs of columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "id": "VBaSjSmDdGaC", "outputId": "d9b5f4e0-51ac-459b-fea8-bc8812a26bd2" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " GOOG APPL AMZN\n", "GOOG 55.248513 13.269122 13.454445\n", "APPL 13.269122 3.488251 3.236487\n", "AMZN 13.454445 3.236487 3.364861" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GOOGAPPLAMZN
GOOG55.24851313.26912213.454445
APPL13.2691223.4882513.236487
AMZN13.4544453.2364873.364861
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"sp\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"GOOG\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 24.18349209326099,\n \"min\": 13.269121918691017,\n \"max\": 55.24851276078893,\n \"num_unique_values\": 3,\n \"samples\": [\n 55.24851276078893,\n 13.269121918691017,\n 13.454444533302368\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"APPL\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5.721051535790905,\n \"min\": 3.236486896205,\n \"max\": 13.269121918691017,\n \"num_unique_values\": 3,\n \"samples\": [\n 13.269121918691017,\n 3.48825113303532,\n 3.236486896205\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"AMZN\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5.862633587955939,\n \"min\": 3.236486896205,\n \"max\": 13.454444533302368,\n \"num_unique_values\": 3,\n \"samples\": [\n 13.454444533302368,\n 3.236486896205,\n 3.364860926703335\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 42 } ], "source": [ "sp.cov(numeric_only=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 112 }, "id": "K2lEOOAW8mX5", "outputId": "b0e15c51-1915-4cb9-e2da-e1c65d661f9f" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " APPL AMZN\n", "APPL 3.488251 3.236487\n", "AMZN 3.236487 3.364861" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
APPLAMZN
APPL3.4882513.236487
AMZN3.2364873.364861
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"sp[['APPL', 'AMZN']]\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"APPL\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.17802419912297515,\n \"min\": 3.236486896205,\n \"max\": 3.48825113303532,\n \"num_unique_values\": 2,\n \"samples\": [\n 3.236486896205,\n 3.48825113303532\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"AMZN\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.09077414749362132,\n \"min\": 3.236486896205,\n \"max\": 3.364860926703335,\n \"num_unique_values\": 2,\n \"samples\": [\n 3.364860926703335,\n 3.236486896205\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 43 } ], "source": [ "sp[['APPL', 'AMZN']].cov()" ] }, { "cell_type": "markdown", "metadata": { "id": "xo3NLs1QdGaD" }, "source": [ "- Some produce multiple summary statistics in one shot:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 300 }, "id": "bR30nxwSdGaE", "outputId": "5dce12b4-bd69-4e0d-c724-e41a4dc12ad0" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " GOOG APPL AMZN\n", "count 4.000000 4.000000 4.000000\n", "mean 533.425003 118.509924 421.550003\n", "std 7.432934 1.867686 1.834356\n", "min 524.219971 116.547424 419.100006\n", "25% 529.154984 117.099819 420.667503\n", "50% 534.350006 118.635792 422.029999\n", "75% 538.620025 120.045897 422.912499\n", "max 540.780029 120.220688 423.040009" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GOOGAPPLAMZN
count4.0000004.0000004.000000
mean533.425003118.509924421.550003
std7.4329341.8676861.834356
min524.219971116.547424419.100006
25%529.154984117.099819420.667503
50%534.350006118.635792422.029999
75%538.620025120.045897422.912499
max540.780029120.220688423.040009
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"sp\",\n \"rows\": 8,\n \"fields\": [\n {\n \"column\": \"GOOG\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 244.33736825518412,\n \"min\": 4.0,\n \"max\": 540.780029,\n \"num_unique_values\": 8,\n \"samples\": [\n 533.4250030000001,\n 534.350006,\n 4.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"APPL\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 53.51923645373125,\n \"min\": 1.8676860370617223,\n \"max\": 120.220688,\n \"num_unique_values\": 8,\n \"samples\": [\n 118.50992400000001,\n 118.63579200000001,\n 4.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"AMZN\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 193.79429323398534,\n \"min\": 1.834355725235249,\n \"max\": 423.040009,\n \"num_unique_values\": 8,\n \"samples\": [\n 421.55000300000006,\n 422.02999850000003,\n 4.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 89 } ], "source": [ "# by default, summarize numeric columns only\n", "sp.describe()" ] }, { "cell_type": "markdown", "source": [ "The `include=['O']` means including only the columns with data type 'object' in the output of the describe() method. Here, 'O' stands for object, which typically pertains to strings or mixed data types in pandas." ], "metadata": { "id": "SjBU1-g8HJuZ" } }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 174 }, "id": "YUv2dmLK8mX8", "outputId": "e322454b-bf14-4122-c9ee-006f331043b6" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Date\n", "count 4\n", "unique 4\n", "top 2015/5/1\n", "freq 1" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Date
count4
unique4
top2015/5/1
freq1
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"sp\",\n \"rows\": 4,\n \"fields\": [\n {\n \"column\": \"Date\",\n \"properties\": {\n \"dtype\": \"date\",\n \"min\": \"1970-01-01 00:00:00.000000001\",\n \"max\": \"2015-05-01 00:00:00\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"4\",\n \"2015/5/1\",\n \"1\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 90 } ], "source": [ "# Python object columns can be selected using include=['O'].\n", "sp.describe(include=['O'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ztubCDQ-8mX9", "outputId": "9a50f269-caad-49fb-8a63-7bac2c4fb7a8" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "\n", "RangeIndex: 4 entries, 0 to 3\n", "Data columns (total 4 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Date 4 non-null object \n", " 1 GOOG 4 non-null float64\n", " 2 APPL 4 non-null float64\n", " 3 AMZN 4 non-null float64\n", "dtypes: float64(3), object(1)\n", "memory usage: 256.0+ bytes\n" ] } ], "source": [ "sp.info()" ] }, { "cell_type": "markdown", "metadata": { "id": "mtZzbfB8gKQu" }, "source": [ "`Series` objects' `value_counts()` method can return the frequency of distinct values it contains:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "-8AcPogegr8y", "outputId": "5f756067-11c9-4b06-9d15-cdfaf2afcde6" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "Date\n", "2015/5/1 1\n", "2015/5/4 1\n", "2015/5/5 1\n", "2015/5/6 1\n", "Name: count, dtype: int64" ] }, "metadata": {}, "execution_count": 93 } ], "source": [ "sp['Date'].value_counts()" ] }, { "cell_type": "markdown", "metadata": { "id": "_SRhspgjrzdg" }, "source": [ "*Excercise*:\n", "Can you calculate the standard deviation of stocks in `sp`?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "aGddi9_HsACN", "outputId": "29eb1e7c-7ff4-46a0-f83e-cdb2d16fbe9a" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "GOOG 7.432934\n", "APPL 1.867686\n", "AMZN 1.834356\n", "dtype: float64" ] }, "metadata": {}, "execution_count": 94 } ], "source": [ "# write your code here\n", "sp[['GOOG', 'APPL', 'AMZN']].std()" ] }, { "cell_type": "markdown", "metadata": { "id": "LJMVU4nU8mW5" }, "source": [ "---\n", "\n", "
\n", "\n", "# 5 Handling Missing Values" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "id": "PMCPChKl8mW5", "outputId": "84d81f35-4918-4b20-9a23-99cdb3920089" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " A B C\n", "0 1.0 5.0 4\n", "1 2.0 NaN 5\n", "2 NaN NaN 6" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
01.05.04
12.0NaN5
2NaNNaN6
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "df_w_nan", "summary": "{\n \"name\": \"df_w_nan\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"A\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.7071067811865476,\n \"min\": 1.0,\n \"max\": 2.0,\n \"num_unique_values\": 2,\n \"samples\": [\n 2.0,\n 1.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"B\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 5.0,\n \"max\": 5.0,\n \"num_unique_values\": 1,\n \"samples\": [\n 5.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"C\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 4,\n \"max\": 6,\n \"num_unique_values\": 3,\n \"samples\": [\n 4\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 95 } ], "source": [ "import numpy as np\n", "\n", "df_w_nan = pd.DataFrame({'A': [1, 2, np.nan],\n", " 'B': [5, np.nan, np.nan],\n", " 'C': [4, 5, 6]})\n", "df_w_nan" ] }, { "cell_type": "markdown", "metadata": { "id": "WxrkQ2wKdGZH" }, "source": [ "\n", "Pandas provides several useful methods for detecting, removing, and replacing missing values in pandas data structures:\n", "\n", "- [`isnull()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isna.html) generates a boolean mask indicating missing values, while [`notnull()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.notna.html) produces the opposite:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "id": "kYRcEYES8mW9", "outputId": "0ce23f0a-1a21-4faf-d5fe-8d3cd5cca75d" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " A B C\n", "0 False False False\n", "1 False True False\n", "2 True True False" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
0FalseFalseFalse
1FalseTrueFalse
2TrueTrueFalse
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"df_w_nan\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"A\",\n \"properties\": {\n \"dtype\": \"boolean\",\n \"num_unique_values\": 2,\n \"samples\": [\n true,\n false\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"B\",\n \"properties\": {\n \"dtype\": \"boolean\",\n \"num_unique_values\": 2,\n \"samples\": [\n true,\n false\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"C\",\n \"properties\": {\n \"dtype\": \"boolean\",\n \"num_unique_values\": 1,\n \"samples\": [\n false\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 96 } ], "source": [ "df_w_nan.isnull()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "xkXQSA9K8mW_", "outputId": "95289e32-344c-4f24-be16-e600b50e926f" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 True\n", "1 True\n", "2 False\n", "Name: A, dtype: bool" ] }, "metadata": {}, "execution_count": 97 } ], "source": [ "df_w_nan.A.notnull()" ] }, { "cell_type": "markdown", "metadata": { "id": "JHC_6HHQ8mXB" }, "source": [ "\n", "- [`dropna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) returns a filtered version of the data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 81 }, "id": "FpDN3MRx8mXC", "outputId": "42add103-652f-4b08-e213-4c398dc2af3a" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " A B C\n", "0 1.0 5.0 4" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
01.05.04
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"df_w_nan\",\n \"rows\": 1,\n \"fields\": [\n {\n \"column\": \"A\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 1.0,\n \"max\": 1.0,\n \"num_unique_values\": 1,\n \"samples\": [\n 1.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"B\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 5.0,\n \"max\": 5.0,\n \"num_unique_values\": 1,\n \"samples\": [\n 5.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"C\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 4,\n \"max\": 4,\n \"num_unique_values\": 1,\n \"samples\": [\n 4\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 98 } ], "source": [ "df_w_nan.dropna(axis=0) # axis=0 Drop rows which contain missing values." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "id": "H6_ulSI-8mXD", "outputId": "467f88cf-36bf-484a-d8f3-74b32990793b" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " C\n", "0 4\n", "1 5\n", "2 6" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
C
04
15
26
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"df_w_nan\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"C\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 4,\n \"max\": 6,\n \"num_unique_values\": 3,\n \"samples\": [\n 4,\n 5,\n 6\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 99 } ], "source": [ "df_w_nan.dropna(axis=1) # axis=1 Drop columns which contain missing values." ] }, { "cell_type": "markdown", "metadata": { "id": "yivKqJG-8mXG" }, "source": [ "\n", "- [`fillna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) returns a copy of the data with missing values filled or imputed (set `inplace=True` to modify it in place):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "id": "lmVV7bpl8mXI", "outputId": "efbf5492-41f9-43c1-d24f-d4088ca1226b" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " A B C\n", "0 1.0 5.0 4\n", "1 2.0 NaN 5\n", "2 NaN NaN 6" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
01.05.04
12.0NaN5
2NaNNaN6
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "df_w_nan", "summary": "{\n \"name\": \"df_w_nan\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"A\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.7071067811865476,\n \"min\": 1.0,\n \"max\": 2.0,\n \"num_unique_values\": 2,\n \"samples\": [\n 2.0,\n 1.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"B\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 5.0,\n \"max\": 5.0,\n \"num_unique_values\": 1,\n \"samples\": [\n 5.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"C\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 4,\n \"max\": 6,\n \"num_unique_values\": 3,\n \"samples\": [\n 4\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 100 } ], "source": [ "df_w_nan" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "id": "YHN9QNRX8mXK", "outputId": "1f7e1037-257f-44c4-f982-3f6d296a93f9" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " A B C\n", "0 1.0 5.0 4\n", "1 2.0 0.0 5\n", "2 0.0 0.0 6" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
01.05.04
12.00.05
20.00.06
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"df_w_nan\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"A\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1.0,\n \"min\": 0.0,\n \"max\": 2.0,\n \"num_unique_values\": 3,\n \"samples\": [\n 1.0,\n 2.0,\n 0.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"B\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2.8867513459481287,\n \"min\": 0.0,\n \"max\": 5.0,\n \"num_unique_values\": 2,\n \"samples\": [\n 0.0,\n 5.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"C\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 4,\n \"max\": 6,\n \"num_unique_values\": 3,\n \"samples\": [\n 4,\n 5\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 102 } ], "source": [ "# Replace all NaN elements with 0s.\n", "df_w_nan.fillna(0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "id": "CPDbf5qv8mXL", "outputId": "26f6ff4d-40c5-4f3b-ae0e-04f03eeabda2" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " A B C\n", "0 1.0 5.0 4\n", "1 2.0 5.0 5\n", "2 2.0 5.0 6" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
01.05.04
12.05.05
22.05.06
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"df_w_nan\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"A\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.5773502691896257,\n \"min\": 1.0,\n \"max\": 2.0,\n \"num_unique_values\": 2,\n \"samples\": [\n 2.0,\n 1.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"B\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.0,\n \"min\": 5.0,\n \"max\": 5.0,\n \"num_unique_values\": 1,\n \"samples\": [\n 5.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"C\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 4,\n \"max\": 6,\n \"num_unique_values\": 3,\n \"samples\": [\n 4\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 103 } ], "source": [ "#ffill() function is used to forward fill the missing value with the value from the previous row (column) when axis = 0 (1)\n", "df_w_nan.fillna(method='ffill', axis=0)" ] }, { "cell_type": "markdown", "metadata": { "id": "lmZwc_wpt_pj" }, "source": [ "A more flexible way to fill or impute values (in place) is to use the assignment form of indexing:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "y4WYqzg-mIjX", "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "outputId": "825c1e82-6723-4e97-b1ef-dbdec91719cd" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " A B C\n", "0 1.0 5.0 4\n", "1 2.0 5.0 5\n", "2 NaN 5.0 6" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
01.05.04
12.05.05
2NaN5.06
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "df_w_nan", "summary": "{\n \"name\": \"df_w_nan\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"A\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.7071067811865476,\n \"min\": 1.0,\n \"max\": 2.0,\n \"num_unique_values\": 2,\n \"samples\": [\n 2.0,\n 1.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"B\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.0,\n \"min\": 5.0,\n \"max\": 5.0,\n \"num_unique_values\": 1,\n \"samples\": [\n 5.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"C\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 4,\n \"max\": 6,\n \"num_unique_values\": 3,\n \"samples\": [\n 4\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 104 } ], "source": [ "df_w_nan.loc[df_w_nan.B.isnull(), 'B'] = df_w_nan.B.mean()\n", "df_w_nan" ] }, { "cell_type": "markdown", "metadata": { "id": "62sWIkFldGap" }, "source": [ "---\n", "\n", "
\n", "\n", "# 6 Computing Group-wise Summary Statistics\n", "\n", "\n", "\n", "Categorizing a dataset and applying a function to each group (whether be an aggregation or transformation) is often a critical component of a data analysis workflow. \n", "\n", "\n", "\n", "\n", "\n", "Splitting data in a `DataFrame` into groups can be done by calling the `DataFrame`'s [`groupby()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) method, passing the name of the desired key column:\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 425 }, "id": "ji2znJdbdGa2", "outputId": "b0adf8e6-c883-4797-f2f4-4c264ba84695" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Date Symbol Price Volume\n", "0 2015/5/1 GOOG 537.900024 1768200\n", "1 2015/5/4 GOOG 540.780029 1308000\n", "2 2015/5/5 GOOG 530.799988 1383100\n", "3 2015/5/6 GOOG 524.219971 1567000\n", "4 2015/5/1 APPL 120.220688 58512600\n", "5 2015/5/4 APPL 119.987633 50988300\n", "6 2015/5/5 APPL 117.283951 49271400\n", "7 2015/5/6 APPL 116.547424 72141000\n", "8 2015/5/1 AMZN 422.869995 3565800\n", "9 2015/5/4 AMZN 423.040009 2270400\n", "10 2015/5/5 AMZN 421.190002 2856400\n", "11 2015/5/6 AMZN 419.100006 2552500" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateSymbolPriceVolume
02015/5/1GOOG537.9000241768200
12015/5/4GOOG540.7800291308000
22015/5/5GOOG530.7999881383100
32015/5/6GOOG524.2199711567000
42015/5/1APPL120.22068858512600
52015/5/4APPL119.98763350988300
62015/5/5APPL117.28395149271400
72015/5/6APPL116.54742472141000
82015/5/1AMZN422.8699953565800
92015/5/4AMZN423.0400092270400
102015/5/5AMZN421.1900022856400
112015/5/6AMZN419.1000062552500
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "stock", "summary": "{\n \"name\": \"stock\",\n \"rows\": 12,\n \"fields\": [\n {\n \"column\": \"Date\",\n \"properties\": {\n \"dtype\": \"object\",\n \"num_unique_values\": 4,\n \"samples\": [\n \"2015/5/4\",\n \"2015/5/6\",\n \"2015/5/1\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Symbol\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"GOOG\",\n \"APPL\",\n \"AMZN\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Price\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 183.11895671451376,\n \"min\": 116.547424,\n \"max\": 540.780029,\n \"num_unique_values\": 12,\n \"samples\": [\n 421.190002,\n 423.040009,\n 537.900024\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Volume\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 27902924,\n \"min\": 1308000,\n \"max\": 72141000,\n \"num_unique_values\": 12,\n \"samples\": [\n 2856400,\n 2270400,\n 1768200\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 44 } ], "source": [ "#! wget -q -O stock.csv \"https://raw.githubusercontent.com/justinjiajia/datafiles/main/pricevolume_sub.csv\"\n", "#stock = pd.read_csv(\"stock.csv\")\n", "stock = pd.read_csv(\"https://raw.githubusercontent.com/justinjiajia/datafiles/main/pricevolume_sub.csv\")\n", "stock" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "pu1SX3u8dGa4", "outputId": "16880a51-126c-4d43-b8e2-dda3f0398a20" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "AMZN\n", " Date Symbol Price Volume\n", "8 2015/5/1 AMZN 422.869995 3565800\n", "9 2015/5/4 AMZN 423.040009 2270400\n", "10 2015/5/5 AMZN 421.190002 2856400\n", "11 2015/5/6 AMZN 419.100006 2552500\n", "APPL\n", " Date Symbol Price Volume\n", "4 2015/5/1 APPL 120.220688 58512600\n", "5 2015/5/4 APPL 119.987633 50988300\n", "6 2015/5/5 APPL 117.283951 49271400\n", "7 2015/5/6 APPL 116.547424 72141000\n", "GOOG\n", " Date Symbol Price Volume\n", "0 2015/5/1 GOOG 537.900024 1768200\n", "1 2015/5/4 GOOG 540.780029 1308000\n", "2 2015/5/5 GOOG 530.799988 1383100\n", "3 2015/5/6 GOOG 524.219971 1567000\n" ] } ], "source": [ "stock_by_symbol = stock.groupby('Symbol')\n", "\n", "# what was returned is a GroupBy object\n", "# wrap it in a loop to have a peek at the resulting grouping\n", "for key, group in stock_by_symbol:\n", " print(key)\n", " print(group)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "S335A83LdGa6", "outputId": "3174bd76-8a29-4625-cebf-45d207043727" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "('2015/5/1', 'AMZN')\n", " Date Symbol Price Volume\n", "8 2015/5/1 AMZN 422.869995 3565800\n", "('2015/5/1', 'APPL')\n", " Date Symbol Price Volume\n", "4 2015/5/1 APPL 120.220688 58512600\n", "('2015/5/1', 'GOOG')\n", " Date Symbol Price Volume\n", "0 2015/5/1 GOOG 537.900024 1768200\n", "('2015/5/4', 'AMZN')\n", " Date Symbol Price Volume\n", "9 2015/5/4 AMZN 423.040009 2270400\n", "('2015/5/4', 'APPL')\n", " Date Symbol Price Volume\n", "5 2015/5/4 APPL 119.987633 50988300\n", "('2015/5/4', 'GOOG')\n", " Date Symbol Price Volume\n", "1 2015/5/4 GOOG 540.780029 1308000\n", "('2015/5/5', 'AMZN')\n", " Date Symbol Price Volume\n", "10 2015/5/5 AMZN 421.190002 2856400\n", "('2015/5/5', 'APPL')\n", " Date Symbol Price Volume\n", "6 2015/5/5 APPL 117.283951 49271400\n", "('2015/5/5', 'GOOG')\n", " Date Symbol Price Volume\n", "2 2015/5/5 GOOG 530.799988 1383100\n", "('2015/5/6', 'AMZN')\n", " Date Symbol Price Volume\n", "11 2015/5/6 AMZN 419.100006 2552500\n", "('2015/5/6', 'APPL')\n", " Date Symbol Price Volume\n", "7 2015/5/6 APPL 116.547424 72141000\n", "('2015/5/6', 'GOOG')\n", " Date Symbol Price Volume\n", "3 2015/5/6 GOOG 524.219971 1567000\n" ] } ], "source": [ "stock_by_date_symbol = stock.groupby(['Date', 'Symbol'])\n", "\n", "for key, group in stock_by_date_symbol:\n", " print(key)\n", " print(group)" ] }, { "cell_type": "markdown", "metadata": { "id": "WDwA2lHpdGbB" }, "source": [ "\n", "\n", "Pandas provides [many common aggregations](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html#computations-descriptive-stats) that can be applied to `GroupBy` objects and return a scalar per group in the apply/combine steps:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "z_mfX9WLcPRu", "outputId": "5460927b-8e65-41f9-e4a5-f0dd6957a7f3" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['Date',\n", " 'Price',\n", " 'Symbol',\n", " 'Volume',\n", " '_DataFrameGroupBy__examples_dataframe_doc',\n", " '__annotations__',\n", " '__class__',\n", " '__class_getitem__',\n", " '__delattr__',\n", " '__dict__',\n", " '__dir__',\n", " '__doc__',\n", " '__eq__',\n", " '__format__',\n", " '__ge__',\n", " '__getattr__',\n", " '__getattribute__',\n", " '__getitem__',\n", " '__gt__',\n", " '__hash__',\n", " '__init__',\n", " '__init_subclass__',\n", " '__iter__',\n", " '__le__',\n", " '__len__',\n", " '__lt__',\n", " '__module__',\n", " '__ne__',\n", " '__new__',\n", " '__orig_bases__',\n", " '__parameters__',\n", " '__reduce__',\n", " '__reduce_ex__',\n", " '__repr__',\n", " '__setattr__',\n", " '__sizeof__',\n", " '__slots__',\n", " '__str__',\n", " '__subclasshook__',\n", " '__weakref__',\n", " '_accessors',\n", " '_agg_examples_doc',\n", " '_agg_general',\n", " '_agg_py_fallback',\n", " '_aggregate_frame',\n", " '_aggregate_with_numba',\n", " '_apply_filter',\n", " '_apply_to_column_groupbys',\n", " '_ascending_count',\n", " '_bool_agg',\n", " '_cache',\n", " '_choose_path',\n", " '_concat_objects',\n", " '_constructor',\n", " '_cumcount_array',\n", " '_cython_agg_general',\n", " '_cython_transform',\n", " '_define_paths',\n", " '_descending_count',\n", " '_dir_additions',\n", " '_dir_deletions',\n", " '_fill',\n", " '_get_cythonized_result',\n", " '_get_data_to_aggregate',\n", " '_get_index',\n", " '_get_indices',\n", " '_gotitem',\n", " '_hidden_attrs',\n", " '_indexed_output_to_ndframe',\n", " '_insert_inaxis_grouper',\n", " '_internal_names',\n", " '_internal_names_set',\n", " '_is_protocol',\n", " '_iterate_column_groupbys',\n", " '_iterate_slices',\n", " '_make_mask_from_int',\n", " '_make_mask_from_list',\n", " '_make_mask_from_positional_indexer',\n", " '_make_mask_from_slice',\n", " '_make_mask_from_tuple',\n", " '_mask_selected_obj',\n", " '_maybe_transpose_result',\n", " '_nth',\n", " '_numba_agg_general',\n", " '_numba_prep',\n", " '_obj_1d_constructor',\n", " '_obj_with_exclusions',\n", " '_op_via_apply',\n", " '_positional_selector',\n", " '_python_agg_general',\n", " '_python_apply_general',\n", " '_reindex_output',\n", " '_reset_cache',\n", " '_selected_obj',\n", " '_selection',\n", " '_selection_list',\n", " '_set_result_index_ordered',\n", " '_transform',\n", " '_transform_general',\n", " '_transform_with_numba',\n", " '_value_counts',\n", " '_wrap_agged_manager',\n", " '_wrap_aggregated_output',\n", " '_wrap_applied_output',\n", " '_wrap_applied_output_series',\n", " '_wrap_transform_fast_result',\n", " 'agg',\n", " 'aggregate',\n", " 'all',\n", " 'any',\n", " 'apply',\n", " 'bfill',\n", " 'boxplot',\n", " 'corr',\n", " 'corrwith',\n", " 'count',\n", " 'cov',\n", " 'cumcount',\n", " 'cummax',\n", " 'cummin',\n", " 'cumprod',\n", " 'cumsum',\n", " 'describe',\n", " 'diff',\n", " 'dtypes',\n", " 'ewm',\n", " 'expanding',\n", " 'ffill',\n", " 'fillna',\n", " 'filter',\n", " 'first',\n", " 'get_group',\n", " 'groups',\n", " 'head',\n", " 'hist',\n", " 'idxmax',\n", " 'idxmin',\n", " 'indices',\n", " 'last',\n", " 'max',\n", " 'mean',\n", " 'median',\n", " 'min',\n", " 'ndim',\n", " 'ngroup',\n", " 'ngroups',\n", " 'nth',\n", " 'nunique',\n", " 'ohlc',\n", " 'pct_change',\n", " 'pipe',\n", " 'plot',\n", " 'prod',\n", " 'quantile',\n", " 'rank',\n", " 'resample',\n", " 'rolling',\n", " 'sample',\n", " 'sem',\n", " 'shift',\n", " 'size',\n", " 'skew',\n", " 'std',\n", " 'sum',\n", " 'tail',\n", " 'take',\n", " 'transform',\n", " 'value_counts',\n", " 'var']" ] }, "metadata": {}, "execution_count": 109 } ], "source": [ "dir(stock_by_symbol)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 174 }, "id": "eCB8zLYgdGbD", "outputId": "d963268a-3340-42c2-9ef2-a2f7a83444df" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Price Volume\n", "Symbol \n", "AMZN 421.550003 2811275.0\n", "APPL 118.509924 57728325.0\n", "GOOG 533.425003 1506575.0" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PriceVolume
Symbol
AMZN421.5500032811275.0
APPL118.50992457728325.0
GOOG533.4250031506575.0
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"stock_by_symbol\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"Symbol\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"AMZN\",\n \"APPL\",\n \"GOOG\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Price\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 214.671775775214,\n \"min\": 118.509924,\n \"max\": 533.4250030000001,\n \"num_unique_values\": 3,\n \"samples\": [\n 421.550003,\n 118.509924,\n 533.4250030000001\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Volume\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 32089639.54262861,\n \"min\": 1506575.0,\n \"max\": 57728325.0,\n \"num_unique_values\": 3,\n \"samples\": [\n 2811275.0,\n 57728325.0,\n 1506575.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 110 } ], "source": [ "# can only apply to numeric columns\n", "stock_by_symbol.mean(numeric_only=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 174 }, "id": "mfaA9Uyn0cI5", "outputId": "ec274c92-0a2b-4c3d-ce4b-6f888a5221a9" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Date Price Volume\n", "Symbol \n", "AMZN 2015/5/1 419.100006 2270400\n", "APPL 2015/5/1 116.547424 49271400\n", "GOOG 2015/5/1 524.219971 1308000" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DatePriceVolume
Symbol
AMZN2015/5/1419.1000062270400
APPL2015/5/1116.54742449271400
GOOG2015/5/1524.2199711308000
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"stock_by_symbol\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"Symbol\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"AMZN\",\n \"APPL\",\n \"GOOG\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Date\",\n \"properties\": {\n \"dtype\": \"object\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"2015/5/1\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Price\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 211.65426899149122,\n \"min\": 116.547424,\n \"max\": 524.219971,\n \"num_unique_values\": 3,\n \"samples\": [\n 419.100006\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Volume\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 27418083,\n \"min\": 1308000,\n \"max\": 49271400,\n \"num_unique_values\": 3,\n \"samples\": [\n 2270400\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 47 } ], "source": [ "# can also apply to categorical columns\n", "stock_by_symbol.min()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 174 }, "id": "o3iyQqHJ0nD0", "outputId": "8c11d8ae-faa1-4d42-f314-41b2eb8ff431" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Date Price Volume\n", "Symbol \n", "AMZN 2015/5/1 422.869995 3565800\n", "APPL 2015/5/1 120.220688 58512600\n", "GOOG 2015/5/1 537.900024 1768200" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DatePriceVolume
Symbol
AMZN2015/5/1422.8699953565800
APPL2015/5/1120.22068858512600
GOOG2015/5/1537.9000241768200
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"stock_by_symbol\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"Symbol\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"AMZN\",\n \"APPL\",\n \"GOOG\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Date\",\n \"properties\": {\n \"dtype\": \"object\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"2015/5/1\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Price\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 215.74851807939186,\n \"min\": 120.220688,\n \"max\": 537.900024,\n \"num_unique_values\": 3,\n \"samples\": [\n 422.869995\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Volume\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 32254997,\n \"min\": 1768200,\n \"max\": 58512600,\n \"num_unique_values\": 3,\n \"samples\": [\n 3565800\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 48 } ], "source": [ "# select the first record of each group\n", "stock_by_symbol.first()" ] }, { "cell_type": "markdown", "metadata": { "id": "Nz6Nb0-GdGbH" }, "source": [ "To suppress using group keys as indices in the aggregated output, we can pass `as_index=False` to `groupby()` when first creating the `GroupBy` object:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "id": "x5YRkPy2dGbH", "outputId": "260fd0f6-079f-4357-d92e-7177d5b5bc7a" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Symbol Price Volume\n", "0 AMZN 421.550003 2811275.0\n", "1 APPL 118.509924 57728325.0\n", "2 GOOG 533.425003 1506575.0" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SymbolPriceVolume
0AMZN421.5500032811275.0
1APPL118.50992457728325.0
2GOOG533.4250031506575.0
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"stock_by_symbol\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"Symbol\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"AMZN\",\n \"APPL\",\n \"GOOG\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Price\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 214.671775775214,\n \"min\": 118.509924,\n \"max\": 533.4250030000001,\n \"num_unique_values\": 3,\n \"samples\": [\n 421.550003,\n 118.509924,\n 533.4250030000001\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Volume\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 32089639.54262861,\n \"min\": 1506575.0,\n \"max\": 57728325.0,\n \"num_unique_values\": 3,\n \"samples\": [\n 2811275.0,\n 57728325.0,\n 1506575.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 49 } ], "source": [ "stock_by_symbol = stock.groupby('Symbol', as_index=False)\n", "stock_by_symbol.mean(numeric_only=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "pcMUIpx1s4BW" }, "source": [ "*In-Class Exercise:*\n", "\n", "What is the most popular names for US babies?\n", "\n", "Below codes allows you to retrieve US baby names from 2004~2014, please apply what you have learnt to get the 5 most popular names during this period with the counts of each name? Hint: you may use `.sort_values(by=\"Column\", ascending=False)` method to sort a DataFrame." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 423 }, "id": "2JIpEZY7tOgA", "outputId": "1cf8a561-c172-4335-df7e-c933ed674983" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Unnamed: 0 Id Name Year Gender State Count\n", "0 11349 11350 Emma 2004 F AK 62\n", "1 11350 11351 Madison 2004 F AK 48\n", "2 11351 11352 Hannah 2004 F AK 46\n", "3 11352 11353 Grace 2004 F AK 44\n", "4 11353 11354 Emily 2004 F AK 41\n", "... ... ... ... ... ... ... ...\n", "1016390 5647421 5647422 Seth 2014 M WY 5\n", "1016391 5647422 5647423 Spencer 2014 M WY 5\n", "1016392 5647423 5647424 Tyce 2014 M WY 5\n", "1016393 5647424 5647425 Victor 2014 M WY 5\n", "1016394 5647425 5647426 Waylon 2014 M WY 5\n", "\n", "[1016395 rows x 7 columns]" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0IdNameYearGenderStateCount
01134911350Emma2004FAK62
11135011351Madison2004FAK48
21135111352Hannah2004FAK46
31135211353Grace2004FAK44
41135311354Emily2004FAK41
........................
101639056474215647422Seth2014MWY5
101639156474225647423Spencer2014MWY5
101639256474235647424Tyce2014MWY5
101639356474245647425Victor2014MWY5
101639456474255647426Waylon2014MWY5
\n", "

1016395 rows × 7 columns

\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "baby_names" } }, "metadata": {}, "execution_count": 3 } ], "source": [ "baby_names = pd.read_csv('https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv')\n", "baby_names" ] }, { "cell_type": "code", "source": [ "#write your code here\n", "group_baby_names = baby_names.groupby('Name')" ], "metadata": { "id": "Xrp2SCEqLKlV" }, "execution_count": 4, "outputs": [] }, { "cell_type": "code", "source": [ "group_baby_names.sum(numeric_only=True).sort_values(by=\"Count\", ascending=False).iloc[:5,[3]]" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 237 }, "id": "KaSgCPzQxn6A", "outputId": "514866e6-4e89-444e-f2d0-1a3c022ddf88" }, "execution_count": 5, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Count\n", "Name \n", "Jacob 242874\n", "Emma 214852\n", "Michael 214405\n", "Ethan 209277\n", "Isabella 204798" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Count
Name
Jacob242874
Emma214852
Michael214405
Ethan209277
Isabella204798
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"group_baby_names\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"Name\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"Emma\",\n \"Isabella\",\n \"Michael\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Count\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 14908,\n \"min\": 204798,\n \"max\": 242874,\n \"num_unique_values\": 5,\n \"samples\": [\n 214852,\n 204798,\n 214405\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 5 } ] }, { "cell_type": "markdown", "metadata": { "id": "eQAMrXLnRNhp" }, "source": [ "*Additional*: What are the 5 most popular male/female baby names?" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "id": "MIOHSZeOWKk4" }, "outputs": [], "source": [ "# write your code here\n", "group_gender_baby_names = baby_names.groupby(['Gender','Name'])" ] }, { "cell_type": "code", "source": [ "name_count_gender = group_gender_baby_names.sum(numeric_only=True)" ], "metadata": { "id": "w2fTmNoTyn_y" }, "execution_count": 7, "outputs": [] }, { "cell_type": "code", "source": [ "name_count_gender.loc['F'].sort_values(by=\"Count\", ascending=False).iloc[:5,[3]]" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 237 }, "id": "BH5bYXiEyxY1", "outputId": "3b4606c0-e456-4abf-a517-7854a8650d8a" }, "execution_count": 8, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Count\n", "Name \n", "Emma 214757\n", "Isabella 204742\n", "Sophia 191421\n", "Emily 190211\n", "Olivia 187962" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Count
Name
Emma214757
Isabella204742
Sophia191421
Emily190211
Olivia187962
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"name_count_gender\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"Name\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"Isabella\",\n \"Olivia\",\n \"Sophia\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Count\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 11519,\n \"min\": 187962,\n \"max\": 214757,\n \"num_unique_values\": 5,\n \"samples\": [\n 204742,\n 187962,\n 191421\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 8 } ] }, { "cell_type": "code", "source": [ "name_count_gender.loc['M'].sort_values(by=\"Count\", ascending=False).iloc[:5,[3]]" ], "metadata": { "id": "OuRPQAHhzToK", "colab": { "base_uri": "https://localhost:8080/", "height": 237 }, "outputId": "5408c993-afb9-4841-cf44-562b2bcfd146" }, "execution_count": 11, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Count\n", "Name \n", "Jacob 242706\n", "Michael 214228\n", "Ethan 209153\n", "William 197796\n", "Joshua 191444" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Count
Name
Jacob242706
Michael214228
Ethan209153
William197796
Joshua191444
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"name_count_gender\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"Name\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"Michael\",\n \"Joshua\",\n \"Ethan\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Count\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 19848,\n \"min\": 191444,\n \"max\": 242706,\n \"num_unique_values\": 5,\n \"samples\": [\n 214228,\n 191444,\n 209153\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 11 } ] } ], "metadata": { "colab": { "provenance": [] }, "hide_input": false, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 0 }