{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "cr_lUENJdGW1"
},
"source": [
"\n",
"\n",
"\n",
"[Pandas](http://pandas.pydata.org/) is a Python library that provides data structures and functions for fast, easy, and expressive manipulation of *structured data*.\n",
"\n",
"\n",
"
\n",
"\n",
"- It provides two main data structures: the `Series` which holds a **1-dimensional sequence** of ***homogeneous*** values, and the `DataFrame`, which holds a ***tabular***, ***heterogeneous*** dataset.\n",
"\n",
"- It also contains a large number of functions and methods to manipulate and summarize `Series` and `DataFrame` objects.\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"id": "yU_LgG6WdGW2"
},
"outputs": [],
"source": [
"import pandas as pd # abbreviated as pd conventionally"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 35
},
"id": "Uv3PBPdW8mU6",
"outputId": "bb76b304-b0df-418c-d9e1-ca819db08816"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"'2.0.3'"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "string"
}
},
"metadata": {},
"execution_count": 2
}
],
"source": [
"pd.__version__"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1pUM0xM48RWJ"
},
"source": [
"# 1 `Series` and `DataFrame`\n",
"\n",
"A [`DataFrame`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) represents a ***2-dimensional***, ***tabular*** data structure containing an ***ordered*** collection of columns, each of which can be a different value type (numeric, string, Boolean, etc.).\n",
"\n",
"Previously in web scraping, we have seen:\n",
"```python\n",
"lie_df = pd.DataFrame({'date': date_list, 'lie': lie_list, 'explanation': explanation_list, 'url': url_list})\n",
"```\n",
"\n",
"\n",
"A `DataFrame` can be thought of as a specialization of a Python dictionary. It maps names (i.e., column names or indcies) to a sequence of data series that share the same set of labels (i.e., row names or indices).\n",
"\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"Let's build up the above `DataFrame` from scratch based on this component view (column by column):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Tu_Te5Ga8HXb",
"outputId": "fc65c8bb-a16d-4d87-f6f7-b27323cbfe42"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0 Wan Chai\n",
"1 North\n",
"2 Sai Kung\n",
"3 Sha Tin\n",
"dtype: object"
]
},
"metadata": {},
"execution_count": 3
}
],
"source": [
"# https://pandas.pydata.org/docs/reference/api/pandas.Series.html\n",
"\n",
"# a Series can be thought of as a 1-dimensional array with attached labels\n",
"# a set of default indices, consisting of the integers 0 through n-1, are automatically attached\n",
"district_name = pd.Series(['Wan Chai', 'North', 'Sai Kung', 'Sha Tin'])\n",
"district_name"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "afo4Oh1ziCzs",
"outputId": "1dd3b6c3-0a3f-4e1f-a08f-2313582bdbe8"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array(['Wan Chai', 'North', 'Sai Kung', 'Sha Tin'], dtype=object)"
]
},
"metadata": {},
"execution_count": 4
}
],
"source": [
"# Return Series as array\n",
"district_name.values"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "B8OMPSkv2Nu6"
},
"source": [
"`Array` is similar to `List`, but it requires all elements to be of the same data type. This characteristic is beneficial for certain operations, especially those that are mathematically intensive, as it allows for more efficient data processing."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Im-OAkGQiJnc",
"outputId": "2b34d92b-3761-49ec-bc9a-a60080774b9d"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"RangeIndex(start=0, stop=4, step=1)"
]
},
"metadata": {},
"execution_count": 5
}
],
"source": [
"# Return the index of the Series.\n",
"district_name.index"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "hp7S6BsD279f",
"outputId": "3aad52f1-afca-4817-df33-c77c5764d294"
},
"outputs": [
{
"data": {
"text/plain": [
"[0, 1, 2, 3]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(district_name.index)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "65HQ7NA-8HXc",
"outputId": "0520f329-f831-4294-aa21-dfdb58a0971d"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0 150900\n",
"1 310800\n",
"2 448600\n",
"3 648200\n",
"dtype: int64"
]
},
"metadata": {},
"execution_count": 6
}
],
"source": [
"district_population = pd.Series([150900, 310800, 448600, 648200])\n",
"district_population"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "C4CTdE0Z8HXc",
"outputId": "87cb7bec-8851-434d-8c84-d49459a74c31"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0 9.83\n",
"1 136.61\n",
"2 129.65\n",
"3 68.71\n",
"dtype: float64"
]
},
"metadata": {},
"execution_count": 7
}
],
"source": [
"district_area = pd.Series([9.83, 136.61, 129.65, 68.71])\n",
"district_area"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 174
},
"id": "B7w5X4KR8HXd",
"outputId": "cbcd4762-d979-4b6a-f689-6f880a43a70b"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" District Population Area\n",
"0 Wan Chai 150900 9.83\n",
"1 North 310800 136.61\n",
"2 Sai Kung 448600 129.65\n",
"3 Sha Tin 648200 68.71"
],
"text/html": [
"\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" District | \n",
" Population | \n",
" Area | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Wan Chai | \n",
" 150900 | \n",
" 9.83 | \n",
"
\n",
" \n",
" 1 | \n",
" North | \n",
" 310800 | \n",
" 136.61 | \n",
"
\n",
" \n",
" 2 | \n",
" Sai Kung | \n",
" 448600 | \n",
" 129.65 | \n",
"
\n",
" \n",
" 3 | \n",
" Sha Tin | \n",
" 648200 | \n",
" 68.71 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"variable_name": "HK_district1",
"summary": "{\n \"name\": \"HK_district1\",\n \"rows\": 4,\n \"fields\": [\n {\n \"column\": \"District\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 4,\n \"samples\": [\n \"North\",\n \"Sha Tin\",\n \"Wan Chai\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 210983,\n \"min\": 150900,\n \"max\": 648200,\n \"num_unique_values\": 4,\n \"samples\": [\n 310800,\n 648200,\n 150900\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Area\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 59.35022493638925,\n \"min\": 9.83,\n \"max\": 136.61,\n \"num_unique_values\": 4,\n \"samples\": [\n 136.61,\n 68.71,\n 9.83\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 8
}
],
"source": [
"# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html\n",
"\n",
"HK_district1 = pd.DataFrame({'District': district_name,\n",
" 'Population': district_population,\n",
" 'Area': district_area})\n",
"HK_district1"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "J5p1mL3o8HXe"
},
"source": [
"A `Series` can also be created with user supplied index.\n",
"\n",
"Apart from making data more readable, the explicit index definition gives the `Series` object additional capabilities such as label-based selection and operation alignment."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "qxBb1vHP8HXe",
"outputId": "69f9e3a0-fbf4-442d-bd2b-400b79b7ea7f"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"Sai Kung 448600\n",
"Sha Tin 648200\n",
"Wan Chai 150900\n",
"North 310800\n",
"dtype: int64"
]
},
"metadata": {},
"execution_count": 9
}
],
"source": [
"district_population2 = pd.Series([448600, 648200, 150900, 310800],\n",
" index=['Sai Kung', 'Sha Tin', 'Wan Chai', 'North'])\n",
"district_population2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "vSYLyA4_8HXf",
"outputId": "9dea60e2-b95d-4179-9af4-17c8900e59b3"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"Wan Chai 9.83\n",
"North 136.61\n",
"Sai Kung 129.65\n",
"Sha Tin 68.71\n",
"dtype: float64"
]
},
"metadata": {},
"execution_count": 10
}
],
"source": [
"district_area2 = pd.Series([9.83, 136.61, 129.65, 68.71],\n",
" index=['Wan Chai', 'North', 'Sai Kung', 'Sha Tin'])\n",
"district_area2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 174
},
"id": "oDgT-R7G8HXg",
"outputId": "37fbac32-2599-4768-b69d-8a2df9d94064"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Population Area\n",
"North 310800 136.61\n",
"Sai Kung 448600 129.65\n",
"Sha Tin 648200 68.71\n",
"Wan Chai 150900 9.83"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Population | \n",
" Area | \n",
"
\n",
" \n",
" \n",
" \n",
" North | \n",
" 310800 | \n",
" 136.61 | \n",
"
\n",
" \n",
" Sai Kung | \n",
" 448600 | \n",
" 129.65 | \n",
"
\n",
" \n",
" Sha Tin | \n",
" 648200 | \n",
" 68.71 | \n",
"
\n",
" \n",
" Wan Chai | \n",
" 150900 | \n",
" 9.83 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"variable_name": "HK_district2",
"summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 4,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 210983,\n \"min\": 150900,\n \"max\": 648200,\n \"num_unique_values\": 4,\n \"samples\": [\n 448600,\n 150900,\n 310800\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Area\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 59.35022493638925,\n \"min\": 9.83,\n \"max\": 136.61,\n \"num_unique_values\": 4,\n \"samples\": [\n 129.65,\n 9.83,\n 136.61\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 11
}
],
"source": [
"HK_district2 = pd.DataFrame({'Population': district_population2, 'Area': district_area2})\n",
"HK_district2"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "K_PNxiU08HXg"
},
"source": [
"The data from the two `Series` are ***aligned via index labels*** (also sorted in the result).\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "0w_xGMOuJfpT",
"outputId": "ebd5a526-39c0-4863-83e1-c670dff7aa02"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"Index(['District', 'Population', 'Area'], dtype='object')"
]
},
"metadata": {},
"execution_count": 12
}
],
"source": [
"# Return the column labels of the DataFrame\n",
"HK_district1.columns"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xqTcdEWs8HXh"
},
"source": [
"Individual columns of a `DataFrame` can be accessed with dictionary-style indexing."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "0wZPsdmz8HXh",
"outputId": "b64b3f74-b56c-4528-db8f-868a11373a37"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0 150900\n",
"1 310800\n",
"2 448600\n",
"3 648200\n",
"Name: Population, dtype: int64"
]
},
"metadata": {},
"execution_count": 13
}
],
"source": [
"HK_district1['Population']"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zWEFJPSg4AqL"
},
"source": [
"They can also be accessed using the attribute reference notation as if they are the attributes of a `DataFrame`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "meGHKttf8HXh",
"outputId": "de99d8a3-2d8f-4ee0-b169-07ccce39c930"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"North 136.61\n",
"Sai Kung 129.65\n",
"Sha Tin 68.71\n",
"Wan Chai 9.83\n",
"Name: Area, dtype: float64"
]
},
"metadata": {},
"execution_count": 14
}
],
"source": [
"HK_district2.Area"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "1rB6e1_wVhux",
"outputId": "070555a2-cab7-42d8-883f-11979108cbd3"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"North 310800\n",
"Sai Kung 448600\n",
"Sha Tin 648200\n",
"Wan Chai 150900\n",
"Name: Population, dtype: int64"
]
},
"metadata": {},
"execution_count": 15
}
],
"source": [
"HK_district2.Population"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "t_RelPLOVQJc"
},
"source": [
"Because pandas is built on top of NumPy, `Series` and `DataFrame` objects support **vectorized operations**. Vectorized operations are a powerful feature of Python that allow you to apply a function or an operation to multiple elements of an array or a dataframe at once, instead of using loops. This can save time, improve your code readability, and reduce your memory usage."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 174
},
"id": "msdHWhpMACwm",
"outputId": "fc324144-caaa-4e03-d12f-32fa1d508796"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Population Area Density\n",
"North 310800 136.61 2275.089671\n",
"Sai Kung 448600 129.65 3460.084844\n",
"Sha Tin 648200 68.71 9433.852423\n",
"Wan Chai 150900 9.83 15350.966429"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Population | \n",
" Area | \n",
" Density | \n",
"
\n",
" \n",
" \n",
" \n",
" North | \n",
" 310800 | \n",
" 136.61 | \n",
" 2275.089671 | \n",
"
\n",
" \n",
" Sai Kung | \n",
" 448600 | \n",
" 129.65 | \n",
" 3460.084844 | \n",
"
\n",
" \n",
" Sha Tin | \n",
" 648200 | \n",
" 68.71 | \n",
" 9433.852423 | \n",
"
\n",
" \n",
" Wan Chai | \n",
" 150900 | \n",
" 9.83 | \n",
" 15350.966429 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"variable_name": "HK_district2",
"summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 4,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 210983,\n \"min\": 150900,\n \"max\": 648200,\n \"num_unique_values\": 4,\n \"samples\": [\n 448600,\n 150900,\n 310800\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Area\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 59.35022493638925,\n \"min\": 9.83,\n \"max\": 136.61,\n \"num_unique_values\": 4,\n \"samples\": [\n 129.65,\n 9.83,\n 136.61\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Density\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6025.790769823514,\n \"min\": 2275.0896713271354,\n \"max\": 15350.966429298067,\n \"num_unique_values\": 4,\n \"samples\": [\n 3460.0848438102585,\n 15350.966429298067,\n 2275.0896713271354\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 17
}
],
"source": [
"# this assignment form of indexing creates a new column\n",
"HK_district2['Density'] = HK_district2.Population / HK_district2.Area\n",
"HK_district2"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XWj8QYIEo3FU"
},
"source": [
"There are many ways to create a DataFrame. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html\n",
"\n",
"Another usual way of creating a DataFrame is by using `pd.DataFrame(data, columns=[column_names],index=[row_names])` explicitly.\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 112
},
"id": "h-eiSsbHpH08",
"outputId": "55260d3a-e87b-4a3e-c87d-7752ea272105"
},
"outputs": [
{
"data": {
"application/vnd.google.colaboratory.intrinsic+json": {
"summary": "{\n \"name\": \"df\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"a\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.7071067811865476,\n \"min\": 3.0,\n \"max\": 4.0,\n \"num_unique_values\": 2,\n \"samples\": [\n 4.0,\n 3.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"b\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.7071067811865476,\n \"min\": 7.0,\n \"max\": 8.0,\n \"num_unique_values\": 2,\n \"samples\": [\n 7.0,\n 8.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}",
"type": "dataframe",
"variable_name": "df"
},
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
"
\n",
" \n",
" \n",
" \n",
" first | \n",
" 3.0 | \n",
" 8.0 | \n",
"
\n",
" \n",
" second | \n",
" 4.0 | \n",
" 7.0 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"text/plain": [
" a b\n",
"first 3.0 8.0\n",
"second 4.0 7.0"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(data=[[3.0, 8.0], [4.0, 7.0]], columns=['a', 'b'], index=['first','second'])\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "n9aeaNrcdGYS"
},
"source": [
"\n",
"---\n",
"\n",
"
\n",
"\n",
"# 2 Data Selection in `DataFrame`s\n",
"\n",
"\n",
"`DataFrame` support both ***label-based*** indexing and ***location-based*** indexing.\n",
"\n",
"Pandas provids two indexer attributes that explicitly expose which indexing scheme to apply:\n",
"\n",
"- The `loc` attribute uses ***label-based*** indexing:\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "BA4iaMZrdGYc",
"outputId": "bd8ee1c3-bdc5-48bd-c73a-be507095c2a7"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"129.65"
]
},
"metadata": {},
"execution_count": 18
}
],
"source": [
"HK_district2.loc['Sai Kung', 'Area']"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5Mc4hyUU7qzY"
},
"source": [
"- The `loc` attribute can be used for slicing based on labels. It can handle slices, single labels, and lists of labels. Both the start and the stop of the slice are inclusive."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 143
},
"id": "DExCstxjdGYf",
"outputId": "95aa7d0c-03ba-46a6-d470-82e9bdc27ab2"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Population Area\n",
"Sai Kung 448600 129.65\n",
"Sha Tin 648200 68.71\n",
"Wan Chai 150900 9.83"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Population | \n",
" Area | \n",
"
\n",
" \n",
" \n",
" \n",
" Sai Kung | \n",
" 448600 | \n",
" 129.65 | \n",
"
\n",
" \n",
" Sha Tin | \n",
" 648200 | \n",
" 68.71 | \n",
"
\n",
" \n",
" Wan Chai | \n",
" 150900 | \n",
" 9.83 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 250257,\n \"min\": 150900,\n \"max\": 648200,\n \"num_unique_values\": 3,\n \"samples\": [\n 448600,\n 648200,\n 150900\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Area\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 59.912951298807954,\n \"min\": 9.83,\n \"max\": 129.65,\n \"num_unique_values\": 3,\n \"samples\": [\n 129.65,\n 68.71,\n 9.83\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 19
}
],
"source": [
"# slicing selects contiguous rows and columns\n",
"# but the last label in inclusive this time\n",
"\n",
"HK_district2.loc['Sai Kung':'Wan Chai', :'Area']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 112
},
"id": "stvLLsCBdGYg",
"outputId": "d450f23e-ffdb-4883-e656-59f29363cf47"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Density Population\n",
"Sai Kung 3460.084844 448600\n",
"Wan Chai 15350.966429 150900"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Density | \n",
" Population | \n",
"
\n",
" \n",
" \n",
" \n",
" Sai Kung | \n",
" 3460.084844 | \n",
" 448600 | \n",
"
\n",
" \n",
" Wan Chai | \n",
" 15350.966429 | \n",
" 150900 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \" ['Density', 'Population']]\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"Density\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 8408.123003384675,\n \"min\": 3460.0848438102585,\n \"max\": 15350.966429298067,\n \"num_unique_values\": 2,\n \"samples\": [\n 15350.966429298067,\n 3460.0848438102585\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 210505,\n \"min\": 150900,\n \"max\": 448600,\n \"num_unique_values\": 2,\n \"samples\": [\n 150900,\n 448600\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 20
}
],
"source": [
"# using indexing lists (or tuples) can select non-contiguous rows and columns\n",
"# can also present them in a different order, e.g., make Density precede Population\n",
"\n",
"HK_district2.loc[['Sai Kung', 'Wan Chai'],\n",
" ['Density', 'Population']]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "p3bklxjn_xY6"
},
"source": [
"Boolean indexing selects items that satisfy certain criteria; important for data filtering."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 112
},
"id": "Q9P1vA4h_0Wt",
"outputId": "9ebdd628-fe53-4bcc-d8b2-dd426fad8704"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Population Density\n",
"North 310800 2275.089671\n",
"Sha Tin 648200 9433.852423"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Population | \n",
" Density | \n",
"
\n",
" \n",
" \n",
" \n",
" North | \n",
" 310800 | \n",
" 2275.089671 | \n",
"
\n",
" \n",
" Sha Tin | \n",
" 648200 | \n",
" 9433.852423 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 238577,\n \"min\": 310800,\n \"max\": 648200,\n \"num_unique_values\": 2,\n \"samples\": [\n 648200,\n 310800\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Density\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5062.009686774815,\n \"min\": 2275.0896713271354,\n \"max\": 9433.852423228062,\n \"num_unique_values\": 2,\n \"samples\": [\n 9433.852423228062,\n 2275.0896713271354\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 21
}
],
"source": [
"import numpy as np\n",
"HK_district2.loc[np.array([True, False, True, False]), ['Population', 'Density']]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 112
},
"id": "g3gB6wGvdGYk",
"outputId": "77bd1fe1-30cf-4d9b-9273-e3e4c310cd52"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Population Density\n",
"North 310800 2275.089671\n",
"Sai Kung 448600 3460.084844"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Population | \n",
" Density | \n",
"
\n",
" \n",
" \n",
" \n",
" North | \n",
" 310800 | \n",
" 2275.089671 | \n",
"
\n",
" \n",
" Sai Kung | \n",
" 448600 | \n",
" 3460.084844 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 97439,\n \"min\": 310800,\n \"max\": 448600,\n \"num_unique_values\": 2,\n \"samples\": [\n 448600,\n 310800\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Density\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 837.9181221361389,\n \"min\": 2275.0896713271354,\n \"max\": 3460.0848438102585,\n \"num_unique_values\": 2,\n \"samples\": [\n 3460.0848438102585,\n 2275.0896713271354\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 23
}
],
"source": [
"# differet types of indexing (and slicing) can be mixedly used\n",
"\n",
"HK_district2.loc[HK_district2.Area > 100, ['Population', 'Density']]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "2_ZI-qRWXyCS",
"outputId": "850b5fc6-0dd6-43ab-efc9-0414a4a3229c"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"North True\n",
"Sai Kung True\n",
"Sha Tin False\n",
"Wan Chai False\n",
"Name: Area, dtype: bool"
]
},
"metadata": {},
"execution_count": 22
}
],
"source": [
"HK_district2.Area > 100"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 81
},
"id": "dYhhQnjf4AKP",
"outputId": "5c1d0050-ee59-4d51-cdef-3d46283773c6"
},
"outputs": [
{
"data": {
"application/vnd.google.colaboratory.intrinsic+json": {
"summary": "{\n \"name\": \" ['Population', 'Density']]\",\n \"rows\": 1,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 648200,\n \"max\": 648200,\n \"num_unique_values\": 1,\n \"samples\": [\n 648200\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Density\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 9433.852423228062,\n \"max\": 9433.852423228062,\n \"num_unique_values\": 1,\n \"samples\": [\n 9433.852423228062\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}",
"type": "dataframe"
},
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Population | \n",
" Density | \n",
"
\n",
" \n",
" \n",
" \n",
" Sha Tin | \n",
" 648200 | \n",
" 9433.852423 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"text/plain": [
" Population Density\n",
"Sha Tin 648200 9433.852423"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Boolean operators are ~, &, and | are used for selection\n",
"\n",
"HK_district2.loc[~(HK_district2.Area > 100) & (HK_district2.Population > 200000),\n",
" ['Population', 'Density']]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MShWN8sU5Iei"
},
"source": [
"Pandas also provide a handy helper function that allows us to query data with less verbose query strings: `query()` method. It is a powerful tool for filtering `DataFrame` rows using a concise and readable expression syntax.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 80
},
"id": "jzylxSwi5ZS9",
"outputId": "c0aea66c-db32-48fc-b443-7b696b929640"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Population Area Density\n",
"Sha Tin 648200 68.71 9433.852423"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Population | \n",
" Area | \n",
" Density | \n",
"
\n",
" \n",
" \n",
" \n",
" Sha Tin | \n",
" 648200 | \n",
" 68.71 | \n",
" 9433.852423 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 1,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 648200,\n \"max\": 648200,\n \"num_unique_values\": 1,\n \"samples\": [\n 648200\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Area\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 68.71,\n \"max\": 68.71,\n \"num_unique_values\": 1,\n \"samples\": [\n 68.71\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Density\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 9433.852423228062,\n \"max\": 9433.852423228062,\n \"num_unique_values\": 1,\n \"samples\": [\n 9433.852423228062\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 24
}
],
"source": [
"HK_district2.query('~ (Area > 100) & (Population > 200000)')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 80
},
"id": "7S3OQroFBBko",
"outputId": "728ddb31-44a9-42cf-97da-a975c7b685cb"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Population Area Density\n",
"Sai Kung 448600 129.65 3460.084844"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Population | \n",
" Area | \n",
" Density | \n",
"
\n",
" \n",
" \n",
" \n",
" Sai Kung | \n",
" 448600 | \n",
" 129.65 | \n",
" 3460.084844 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 1,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 448600,\n \"max\": 448600,\n \"num_unique_values\": 1,\n \"samples\": [\n 448600\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Area\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 129.65,\n \"max\": 129.65,\n \"num_unique_values\": 1,\n \"samples\": [\n 129.65\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Density\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 3460.0848438102585,\n \"max\": 3460.0848438102585,\n \"num_unique_values\": 1,\n \"samples\": [\n 3460.0848438102585\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 25
}
],
"source": [
"HK_district2.query('index == \"Sai Kung\"')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 112
},
"id": "uo9WD616KiVw",
"outputId": "cb8072a2-7914-4d63-d41e-22c725cd38c8"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Population Area Density\n",
"Sai Kung 448600 129.65 3460.084844\n",
"Sha Tin 648200 68.71 9433.852423"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Population | \n",
" Area | \n",
" Density | \n",
"
\n",
" \n",
" \n",
" \n",
" Sai Kung | \n",
" 448600 | \n",
" 129.65 | \n",
" 3460.084844 | \n",
"
\n",
" \n",
" Sha Tin | \n",
" 648200 | \n",
" 68.71 | \n",
" 9433.852423 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 141138,\n \"min\": 448600,\n \"max\": 648200,\n \"num_unique_values\": 2,\n \"samples\": [\n 648200,\n 448600\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Area\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 43.09108724550821,\n \"min\": 68.71,\n \"max\": 129.65,\n \"num_unique_values\": 2,\n \"samples\": [\n 68.71,\n 129.65\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Density\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 4224.091564638677,\n \"min\": 3460.0848438102585,\n \"max\": 9433.852423228062,\n \"num_unique_values\": 2,\n \"samples\": [\n 9433.852423228062,\n 3460.0848438102585\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 27
}
],
"source": [
"# Can take a 1-argument function. The x passed to the lambda is the DataFrame being sliced.\n",
"\n",
"HK_district2.loc[lambda x: [i[0]=='S' for i in x.index], :]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "NMrEnfVBBh1c",
"outputId": "bef02d96-02bb-45c0-ac39-07c065f0ffea"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"Index(['North', 'Sai Kung', 'Sha Tin', 'Wan Chai'], dtype='object')"
]
},
"metadata": {},
"execution_count": 26
}
],
"source": [
"HK_district2.index"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9sR1oXZNCZxZ"
},
"source": [
"If the second argument (column labels) is omitted, `.loc` will return all columns for the specified rows."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 112
},
"id": "EpR31NUiCVIJ",
"outputId": "5c368e77-16a7-488f-cf04-ca70a85d0408"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Population Area Density\n",
"Sai Kung 448600 129.65 3460.084844\n",
"Sha Tin 648200 68.71 9433.852423"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Population | \n",
" Area | \n",
" Density | \n",
"
\n",
" \n",
" \n",
" \n",
" Sai Kung | \n",
" 448600 | \n",
" 129.65 | \n",
" 3460.084844 | \n",
"
\n",
" \n",
" Sha Tin | \n",
" 648200 | \n",
" 68.71 | \n",
" 9433.852423 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 141138,\n \"min\": 448600,\n \"max\": 648200,\n \"num_unique_values\": 2,\n \"samples\": [\n 648200,\n 448600\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Area\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 43.09108724550821,\n \"min\": 68.71,\n \"max\": 129.65,\n \"num_unique_values\": 2,\n \"samples\": [\n 68.71,\n 129.65\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Density\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 4224.091564638677,\n \"min\": 3460.0848438102585,\n \"max\": 9433.852423228062,\n \"num_unique_values\": 2,\n \"samples\": [\n 9433.852423228062,\n 3460.0848438102585\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 28
}
],
"source": [
"HK_district2.loc[lambda x: [i[0]=='S' for i in x.index]]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "K76qH-PcdGYo"
},
"source": [
"- The `iloc` attribute uses Python-style ***location-based*** indexing:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 174
},
"id": "XKhgaIyNY9th",
"outputId": "03735dfe-5d16-4420-fd39-16e58290e8da"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Population Area Density\n",
"North 310800 136.61 2275.089671\n",
"Sai Kung 448600 129.65 3460.084844\n",
"Sha Tin 648200 68.71 9433.852423\n",
"Wan Chai 150900 9.83 15350.966429"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Population | \n",
" Area | \n",
" Density | \n",
"
\n",
" \n",
" \n",
" \n",
" North | \n",
" 310800 | \n",
" 136.61 | \n",
" 2275.089671 | \n",
"
\n",
" \n",
" Sai Kung | \n",
" 448600 | \n",
" 129.65 | \n",
" 3460.084844 | \n",
"
\n",
" \n",
" Sha Tin | \n",
" 648200 | \n",
" 68.71 | \n",
" 9433.852423 | \n",
"
\n",
" \n",
" Wan Chai | \n",
" 150900 | \n",
" 9.83 | \n",
" 15350.966429 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"variable_name": "HK_district2",
"summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 4,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 210983,\n \"min\": 150900,\n \"max\": 648200,\n \"num_unique_values\": 4,\n \"samples\": [\n 448600,\n 150900,\n 310800\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Area\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 59.35022493638925,\n \"min\": 9.83,\n \"max\": 136.61,\n \"num_unique_values\": 4,\n \"samples\": [\n 129.65,\n 9.83,\n 136.61\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Density\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6025.790769823514,\n \"min\": 2275.0896713271354,\n \"max\": 15350.966429298067,\n \"num_unique_values\": 4,\n \"samples\": [\n 3460.0848438102585,\n 15350.966429298067,\n 2275.0896713271354\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 29
}
],
"source": [
"HK_district2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "iI2cWX6xdGYp",
"outputId": "328bea5b-fe28-4de2-c0b7-4eeb7cd0139b"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"129.65"
]
},
"metadata": {},
"execution_count": 30
}
],
"source": [
"# 0-based indexing; starting from zero\n",
"HK_district2.iloc[1, 1]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 143
},
"id": "hDbkTrEqZISx",
"outputId": "38b90700-cc1b-4e21-d9e5-feeae05b61cf"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Population Area\n",
"Sai Kung 448600 129.65\n",
"Sha Tin 648200 68.71\n",
"Wan Chai 150900 9.83"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Population | \n",
" Area | \n",
"
\n",
" \n",
" \n",
" \n",
" Sai Kung | \n",
" 448600 | \n",
" 129.65 | \n",
"
\n",
" \n",
" Sha Tin | \n",
" 648200 | \n",
" 68.71 | \n",
"
\n",
" \n",
" Wan Chai | \n",
" 150900 | \n",
" 9.83 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 250257,\n \"min\": 150900,\n \"max\": 648200,\n \"num_unique_values\": 3,\n \"samples\": [\n 448600,\n 648200,\n 150900\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Area\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 59.912951298807954,\n \"min\": 9.83,\n \"max\": 129.65,\n \"num_unique_values\": 3,\n \"samples\": [\n 129.65,\n 68.71,\n 9.83\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 31
}
],
"source": [
"# the last index is exclusive as with regular Python slicing\n",
"HK_district2.iloc[1:4, :2]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 112
},
"id": "ymCsTpxxaD_l",
"outputId": "4f87e9ed-06df-4de8-b5cb-1f82a5203b38"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Density Population\n",
"Sai Kung 3460.084844 448600\n",
"Wan Chai 15350.966429 150900"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Density | \n",
" Population | \n",
"
\n",
" \n",
" \n",
" \n",
" Sai Kung | \n",
" 3460.084844 | \n",
" 448600 | \n",
"
\n",
" \n",
" Wan Chai | \n",
" 15350.966429 | \n",
" 150900 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"Density\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 8408.123003384675,\n \"min\": 3460.0848438102585,\n \"max\": 15350.966429298067,\n \"num_unique_values\": 2,\n \"samples\": [\n 15350.966429298067,\n 3460.0848438102585\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 210505,\n \"min\": 150900,\n \"max\": 448600,\n \"num_unique_values\": 2,\n \"samples\": [\n 150900,\n 448600\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 32
}
],
"source": [
"# select non-contiguous rows and columns\n",
"HK_district2.iloc[[1, 3], [2, 0]]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "PpWB3RFKEE0W"
},
"source": [
"The `.iloc` is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 112
},
"id": "wuPmilQLdGYy",
"outputId": "c7398335-05e0-46e1-e226-46d61252bde0",
"scrolled": true
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Population Density\n",
"North 310800 2275.089671\n",
"Sai Kung 448600 3460.084844"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Population | \n",
" Density | \n",
"
\n",
" \n",
" \n",
" \n",
" North | \n",
" 310800 | \n",
" 2275.089671 | \n",
"
\n",
" \n",
" Sai Kung | \n",
" 448600 | \n",
" 3460.084844 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 97439,\n \"min\": 310800,\n \"max\": 448600,\n \"num_unique_values\": 2,\n \"samples\": [\n 448600,\n 310800\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Density\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 837.9181221361389,\n \"min\": 2275.0896713271354,\n \"max\": 3460.0848438102585,\n \"num_unique_values\": 2,\n \"samples\": [\n 3460.0848438102585,\n 2275.0896713271354\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 34
}
],
"source": [
"# what HK_district2.Area returns is a Series\n",
"# iloc can only take a NumPy array, which can be accessed via .values\n",
"HK_district2.iloc[(HK_district2.Area > 100).values, [0, 2]]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Gr7Yk_SHNh0h",
"outputId": "65c0428b-0cbd-47fb-c9cc-056fba605057"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([ True, True, False, False])"
]
},
"metadata": {},
"execution_count": 33
}
],
"source": [
"(HK_district2.Area > 100).values"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 112
},
"id": "WmyrwCr5Nmaj",
"outputId": "5c7eb563-c70c-453b-8ade-2d52b028a6e7"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Population Density\n",
"North 310800 2275.089671\n",
"Sai Kung 448600 3460.084844"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Population | \n",
" Density | \n",
"
\n",
" \n",
" \n",
" \n",
" North | \n",
" 310800 | \n",
" 2275.089671 | \n",
"
\n",
" \n",
" Sai Kung | \n",
" 448600 | \n",
" 3460.084844 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 97439,\n \"min\": 310800,\n \"max\": 448600,\n \"num_unique_values\": 2,\n \"samples\": [\n 448600,\n 310800\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Density\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 837.9181221361389,\n \"min\": 2275.0896713271354,\n \"max\": 3460.0848438102585,\n \"num_unique_values\": 2,\n \"samples\": [\n 3460.0848438102585,\n 2275.0896713271354\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 35
}
],
"source": [
"HK_district2.iloc[(HK_district2.Area > 100).values, [True, False, True]]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 112
},
"id": "PyjoPDkNIkZj",
"outputId": "d41fa286-2138-4993-9711-bd421e8d957c"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Area\n",
"North 136.61\n",
"Sha Tin 68.71"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Area | \n",
"
\n",
" \n",
" \n",
" \n",
" North | \n",
" 136.61 | \n",
"
\n",
" \n",
" Sha Tin | \n",
" 68.71 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \" lambda x: [i[0]=='A' for i in x\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"Area\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 48.01255044256659,\n \"min\": 68.71,\n \"max\": 136.61,\n \"num_unique_values\": 2,\n \"samples\": [\n 68.71,\n 136.61\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 36
}
],
"source": [
"# Can take a 1-argument function. The x passed to the lambda is the DataFrame being sliced.\n",
"\n",
"HK_district2.iloc[lambda x: [i for i in range(len(x)) if i % 2 == 0],\n",
" lambda x: [i[0]=='A' for i in x.columns]]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "-Z5fakGdEWN6",
"outputId": "9b72ed50-3ce6-473b-9643-1a2bae31369e"
},
"outputs": [
{
"data": {
"text/plain": [
"range(0, 4)"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"range(len(HK_district2))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "e0p_b4FFEiaG",
"outputId": "1fd23d16-d543-4eb7-aa6f-6c1e6843d597"
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Population', 'Area', 'Density'], dtype='object')"
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"HK_district2.columns"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2TQdkZzoqI8Y"
},
"source": [
"*Exercise*:\n",
"Can you select district(s) whose name is shorter than 6 characters and show their `Population`?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "jHOnVubqsDTn",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 80
},
"outputId": "f8a50661-4fac-48e5-8e3a-c4da479b8964"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Population\n",
"North 310800"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Population | \n",
"
\n",
" \n",
" \n",
" \n",
" North | \n",
" 310800 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"HK_district2\",\n \"rows\": 1,\n \"fields\": [\n {\n \"column\": \"Population\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 310800,\n \"max\": 310800,\n \"num_unique_values\": 1,\n \"samples\": [\n 310800\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 37
}
],
"source": [
"# write your code here\n",
"\n",
"HK_district2.iloc[lambda x: [len(i)<6 for i in x.index],[0]]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "gN_jwMl-dGZy"
},
"source": [
"---\n",
"\n",
"
\n",
"\n",
"# 3 Importing and Exporting Data\n",
"\n",
"Pandas features a number of [functions](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) for reading tabular data as a `DataFrame` object. Among them, [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) is likely the one we'll use the most:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 174
},
"id": "UTYNJD03Yb0b",
"outputId": "4128a707-1d22-4c27-a37c-836760f98730"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Date GOOG APPL AMZN\n",
"0 2015/5/1 537.900024 120.220688 422.869995\n",
"1 2015/5/4 540.780029 119.987633 423.040009\n",
"2 2015/5/5 530.799988 117.283951 421.190002\n",
"3 2015/5/6 524.219971 116.547424 419.100006"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Date | \n",
" GOOG | \n",
" APPL | \n",
" AMZN | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2015/5/1 | \n",
" 537.900024 | \n",
" 120.220688 | \n",
" 422.869995 | \n",
"
\n",
" \n",
" 1 | \n",
" 2015/5/4 | \n",
" 540.780029 | \n",
" 119.987633 | \n",
" 423.040009 | \n",
"
\n",
" \n",
" 2 | \n",
" 2015/5/5 | \n",
" 530.799988 | \n",
" 117.283951 | \n",
" 421.190002 | \n",
"
\n",
" \n",
" 3 | \n",
" 2015/5/6 | \n",
" 524.219971 | \n",
" 116.547424 | \n",
" 419.100006 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"variable_name": "sp",
"summary": "{\n \"name\": \"sp\",\n \"rows\": 4,\n \"fields\": [\n {\n \"column\": \"Date\",\n \"properties\": {\n \"dtype\": \"object\",\n \"num_unique_values\": 4,\n \"samples\": [\n \"2015/5/4\",\n \"2015/5/6\",\n \"2015/5/1\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"GOOG\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 7.432934330450454,\n \"min\": 524.219971,\n \"max\": 540.780029,\n \"num_unique_values\": 4,\n \"samples\": [\n 540.780029,\n 524.219971,\n 537.900024\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"APPL\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1.8676860370617223,\n \"min\": 116.547424,\n \"max\": 120.220688,\n \"num_unique_values\": 4,\n \"samples\": [\n 119.987633,\n 116.547424,\n 120.220688\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"AMZN\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1.834355725235249,\n \"min\": 419.100006,\n \"max\": 423.040009,\n \"num_unique_values\": 4,\n \"samples\": [\n 423.040009,\n 419.100006,\n 422.869995\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 38
}
],
"source": [
"sp = pd.read_csv(\"https://raw.githubusercontent.com/justinjiajia/datafiles/master/adj_closing_sub.csv\")\n",
"sp"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BPXlruYGdGZ2"
},
"source": [
"The corresponding writer functions are object methods that are accessed like [`DataFrame.to_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html):\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Hmt1EnUJdGZ5"
},
"outputs": [],
"source": [
"sp.to_csv(\"stockprice_new.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "v38QUl8udGZ8"
},
"source": [
"\n",
"---\n",
"\n",
"
\n",
"\n",
"# 4 Computing Summary and Descriptive Statistics\n",
"\n",
"\n",
"`DataFrame` objects are equipped with common mathematical and statistical [methods](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#computations-descriptive-stats) for column-wise computations (or row-wise by setting `axis=1`):\n",
"\n",
"- Most of them produce aggregates:\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Ke5EX-BYdGZ9",
"outputId": "86254c4e-b11b-4da9-cfe0-63ee26dee4c6"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"GOOG 533.425003\n",
"APPL 118.509924\n",
"dtype: float64"
]
},
"metadata": {},
"execution_count": 39
}
],
"source": [
"sp[['GOOG', 'APPL']].mean()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "nBb16pzt8mX3",
"outputId": "a37118d5-d770-4007-8801-b4a4b0acb28f"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"Date 4\n",
"APPL 4\n",
"dtype: int64"
]
},
"metadata": {},
"execution_count": 40
}
],
"source": [
"sp[['Date', 'APPL']].nunique()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "pgedhvgAZjhs",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "c383a0fe-8a75-4381-d2ad-61462e03a8cb"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['AMZN',\n",
" 'APPL',\n",
" 'Date',\n",
" 'GOOG',\n",
" 'T',\n",
" '_AXIS_LEN',\n",
" '_AXIS_ORDERS',\n",
" '_AXIS_TO_AXIS_NUMBER',\n",
" '_HANDLED_TYPES',\n",
" '__abs__',\n",
" '__add__',\n",
" '__and__',\n",
" '__annotations__',\n",
" '__array__',\n",
" '__array_priority__',\n",
" '__array_ufunc__',\n",
" '__bool__',\n",
" '__class__',\n",
" '__contains__',\n",
" '__copy__',\n",
" '__dataframe__',\n",
" '__deepcopy__',\n",
" '__delattr__',\n",
" '__delitem__',\n",
" '__dict__',\n",
" '__dir__',\n",
" '__divmod__',\n",
" '__doc__',\n",
" '__eq__',\n",
" '__finalize__',\n",
" '__floordiv__',\n",
" '__format__',\n",
" '__ge__',\n",
" '__getattr__',\n",
" '__getattribute__',\n",
" '__getitem__',\n",
" '__getstate__',\n",
" '__gt__',\n",
" '__hash__',\n",
" '__iadd__',\n",
" '__iand__',\n",
" '__ifloordiv__',\n",
" '__imod__',\n",
" '__imul__',\n",
" '__init__',\n",
" '__init_subclass__',\n",
" '__invert__',\n",
" '__ior__',\n",
" '__ipow__',\n",
" '__isub__',\n",
" '__iter__',\n",
" '__itruediv__',\n",
" '__ixor__',\n",
" '__le__',\n",
" '__len__',\n",
" '__lt__',\n",
" '__matmul__',\n",
" '__mod__',\n",
" '__module__',\n",
" '__mul__',\n",
" '__ne__',\n",
" '__neg__',\n",
" '__new__',\n",
" '__nonzero__',\n",
" '__or__',\n",
" '__pos__',\n",
" '__pow__',\n",
" '__radd__',\n",
" '__rand__',\n",
" '__rdivmod__',\n",
" '__reduce__',\n",
" '__reduce_ex__',\n",
" '__repr__',\n",
" '__rfloordiv__',\n",
" '__rmatmul__',\n",
" '__rmod__',\n",
" '__rmul__',\n",
" '__ror__',\n",
" '__round__',\n",
" '__rpow__',\n",
" '__rsub__',\n",
" '__rtruediv__',\n",
" '__rxor__',\n",
" '__setattr__',\n",
" '__setitem__',\n",
" '__setstate__',\n",
" '__sizeof__',\n",
" '__str__',\n",
" '__sub__',\n",
" '__subclasshook__',\n",
" '__truediv__',\n",
" '__weakref__',\n",
" '__xor__',\n",
" '_accessors',\n",
" '_accum_func',\n",
" '_add_numeric_operations',\n",
" '_agg_examples_doc',\n",
" '_agg_summary_and_see_also_doc',\n",
" '_align_frame',\n",
" '_align_series',\n",
" '_append',\n",
" '_arith_method',\n",
" '_as_manager',\n",
" '_attrs',\n",
" '_box_col_values',\n",
" '_can_fast_transpose',\n",
" '_check_inplace_and_allows_duplicate_labels',\n",
" '_check_inplace_setting',\n",
" '_check_is_chained_assignment_possible',\n",
" '_check_label_or_level_ambiguity',\n",
" '_check_setitem_copy',\n",
" '_clear_item_cache',\n",
" '_clip_with_one_bound',\n",
" '_clip_with_scalar',\n",
" '_cmp_method',\n",
" '_combine_frame',\n",
" '_consolidate',\n",
" '_consolidate_inplace',\n",
" '_construct_axes_dict',\n",
" '_construct_result',\n",
" '_constructor',\n",
" '_constructor_sliced',\n",
" '_create_data_for_split_and_tight_to_dict',\n",
" '_data',\n",
" '_dir_additions',\n",
" '_dir_deletions',\n",
" '_dispatch_frame_op',\n",
" '_drop_axis',\n",
" '_drop_labels_or_levels',\n",
" '_ensure_valid_index',\n",
" '_find_valid_index',\n",
" '_flags',\n",
" '_from_arrays',\n",
" '_get_agg_axis',\n",
" '_get_axis',\n",
" '_get_axis_name',\n",
" '_get_axis_number',\n",
" '_get_axis_resolvers',\n",
" '_get_block_manager_axis',\n",
" '_get_bool_data',\n",
" '_get_cleaned_column_resolvers',\n",
" '_get_column_array',\n",
" '_get_index_resolvers',\n",
" '_get_item_cache',\n",
" '_get_label_or_level_values',\n",
" '_get_numeric_data',\n",
" '_get_value',\n",
" '_getitem_bool_array',\n",
" '_getitem_multilevel',\n",
" '_getitem_nocopy',\n",
" '_gotitem',\n",
" '_hidden_attrs',\n",
" '_indexed_same',\n",
" '_info_axis',\n",
" '_info_axis_name',\n",
" '_info_axis_number',\n",
" '_info_repr',\n",
" '_init_mgr',\n",
" '_inplace_method',\n",
" '_internal_names',\n",
" '_internal_names_set',\n",
" '_is_copy',\n",
" '_is_homogeneous_type',\n",
" '_is_label_or_level_reference',\n",
" '_is_label_reference',\n",
" '_is_level_reference',\n",
" '_is_mixed_type',\n",
" '_is_view',\n",
" '_iset_item',\n",
" '_iset_item_mgr',\n",
" '_iset_not_inplace',\n",
" '_item_cache',\n",
" '_iter_column_arrays',\n",
" '_ixs',\n",
" '_join_compat',\n",
" '_logical_func',\n",
" '_logical_method',\n",
" '_maybe_cache_changed',\n",
" '_maybe_update_cacher',\n",
" '_metadata',\n",
" '_mgr',\n",
" '_min_count_stat_function',\n",
" '_needs_reindex_multi',\n",
" '_protect_consolidate',\n",
" '_reduce',\n",
" '_reduce_axis1',\n",
" '_reindex_axes',\n",
" '_reindex_columns',\n",
" '_reindex_index',\n",
" '_reindex_multi',\n",
" '_reindex_with_indexers',\n",
" '_rename',\n",
" '_replace_columnwise',\n",
" '_repr_data_resource_',\n",
" '_repr_fits_horizontal_',\n",
" '_repr_fits_vertical_',\n",
" '_repr_html_',\n",
" '_repr_latex_',\n",
" '_reset_cache',\n",
" '_reset_cacher',\n",
" '_sanitize_column',\n",
" '_series',\n",
" '_set_axis',\n",
" '_set_axis_name',\n",
" '_set_axis_nocheck',\n",
" '_set_is_copy',\n",
" '_set_item',\n",
" '_set_item_frame_value',\n",
" '_set_item_mgr',\n",
" '_set_value',\n",
" '_setitem_array',\n",
" '_setitem_frame',\n",
" '_setitem_slice',\n",
" '_slice',\n",
" '_stat_axis',\n",
" '_stat_axis_name',\n",
" '_stat_axis_number',\n",
" '_stat_function',\n",
" '_stat_function_ddof',\n",
" '_take',\n",
" '_take_with_is_copy',\n",
" '_to_dict_of_blocks',\n",
" '_to_latex_via_styler',\n",
" '_typ',\n",
" '_update_inplace',\n",
" '_validate_dtype',\n",
" '_values',\n",
" '_where',\n",
" 'abs',\n",
" 'add',\n",
" 'add_prefix',\n",
" 'add_suffix',\n",
" 'agg',\n",
" 'aggregate',\n",
" 'align',\n",
" 'all',\n",
" 'any',\n",
" 'apply',\n",
" 'applymap',\n",
" 'asfreq',\n",
" 'asof',\n",
" 'assign',\n",
" 'astype',\n",
" 'at',\n",
" 'at_time',\n",
" 'attrs',\n",
" 'axes',\n",
" 'backfill',\n",
" 'between_time',\n",
" 'bfill',\n",
" 'bool',\n",
" 'boxplot',\n",
" 'clip',\n",
" 'columns',\n",
" 'combine',\n",
" 'combine_first',\n",
" 'compare',\n",
" 'convert_dtypes',\n",
" 'copy',\n",
" 'corr',\n",
" 'corrwith',\n",
" 'count',\n",
" 'cov',\n",
" 'cummax',\n",
" 'cummin',\n",
" 'cumprod',\n",
" 'cumsum',\n",
" 'describe',\n",
" 'diff',\n",
" 'div',\n",
" 'divide',\n",
" 'dot',\n",
" 'drop',\n",
" 'drop_duplicates',\n",
" 'droplevel',\n",
" 'dropna',\n",
" 'dtypes',\n",
" 'duplicated',\n",
" 'empty',\n",
" 'eq',\n",
" 'equals',\n",
" 'eval',\n",
" 'ewm',\n",
" 'expanding',\n",
" 'explode',\n",
" 'ffill',\n",
" 'fillna',\n",
" 'filter',\n",
" 'first',\n",
" 'first_valid_index',\n",
" 'flags',\n",
" 'floordiv',\n",
" 'from_dict',\n",
" 'from_records',\n",
" 'ge',\n",
" 'get',\n",
" 'groupby',\n",
" 'gt',\n",
" 'head',\n",
" 'hist',\n",
" 'iat',\n",
" 'idxmax',\n",
" 'idxmin',\n",
" 'iloc',\n",
" 'index',\n",
" 'infer_objects',\n",
" 'info',\n",
" 'insert',\n",
" 'interpolate',\n",
" 'isetitem',\n",
" 'isin',\n",
" 'isna',\n",
" 'isnull',\n",
" 'items',\n",
" 'iterrows',\n",
" 'itertuples',\n",
" 'join',\n",
" 'keys',\n",
" 'kurt',\n",
" 'kurtosis',\n",
" 'last',\n",
" 'last_valid_index',\n",
" 'le',\n",
" 'loc',\n",
" 'lt',\n",
" 'mask',\n",
" 'max',\n",
" 'mean',\n",
" 'median',\n",
" 'melt',\n",
" 'memory_usage',\n",
" 'merge',\n",
" 'min',\n",
" 'mod',\n",
" 'mode',\n",
" 'mul',\n",
" 'multiply',\n",
" 'ndim',\n",
" 'ne',\n",
" 'nlargest',\n",
" 'notna',\n",
" 'notnull',\n",
" 'nsmallest',\n",
" 'nunique',\n",
" 'pad',\n",
" 'pct_change',\n",
" 'pipe',\n",
" 'pivot',\n",
" 'pivot_table',\n",
" 'plot',\n",
" 'pop',\n",
" 'pow',\n",
" 'prod',\n",
" 'product',\n",
" 'quantile',\n",
" 'query',\n",
" 'radd',\n",
" 'rank',\n",
" 'rdiv',\n",
" 'reindex',\n",
" 'reindex_like',\n",
" 'rename',\n",
" 'rename_axis',\n",
" 'reorder_levels',\n",
" 'replace',\n",
" 'resample',\n",
" 'reset_index',\n",
" 'rfloordiv',\n",
" 'rmod',\n",
" 'rmul',\n",
" 'rolling',\n",
" 'round',\n",
" 'rpow',\n",
" 'rsub',\n",
" 'rtruediv',\n",
" 'sample',\n",
" 'select_dtypes',\n",
" 'sem',\n",
" 'set_axis',\n",
" 'set_flags',\n",
" 'set_geometry',\n",
" 'set_index',\n",
" 'shape',\n",
" 'shift',\n",
" 'size',\n",
" 'skew',\n",
" 'sort_index',\n",
" 'sort_values',\n",
" 'squeeze',\n",
" 'stack',\n",
" 'std',\n",
" 'style',\n",
" 'sub',\n",
" 'subtract',\n",
" 'sum',\n",
" 'swapaxes',\n",
" 'swaplevel',\n",
" 'tail',\n",
" 'take',\n",
" 'to_clipboard',\n",
" 'to_csv',\n",
" 'to_dict',\n",
" 'to_excel',\n",
" 'to_feather',\n",
" 'to_gbq',\n",
" 'to_hdf',\n",
" 'to_html',\n",
" 'to_json',\n",
" 'to_latex',\n",
" 'to_markdown',\n",
" 'to_numpy',\n",
" 'to_orc',\n",
" 'to_parquet',\n",
" 'to_period',\n",
" 'to_pickle',\n",
" 'to_records',\n",
" 'to_sql',\n",
" 'to_stata',\n",
" 'to_string',\n",
" 'to_timestamp',\n",
" 'to_xarray',\n",
" 'to_xml',\n",
" 'transform',\n",
" 'transpose',\n",
" 'truediv',\n",
" 'truncate',\n",
" 'tz_convert',\n",
" 'tz_localize',\n",
" 'unstack',\n",
" 'update',\n",
" 'value_counts',\n",
" 'values',\n",
" 'var',\n",
" 'where',\n",
" 'xs']"
]
},
"metadata": {},
"execution_count": 41
}
],
"source": [
"dir(sp)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NUufnYU-8mX4"
},
"source": [
"The following table summarizes some built-in Pandas aggregations:\n",
"\n",
"| Aggregation | Description |\n",
"|--------------------------|---------------------------------|\n",
"| ``count()`` | Total number of items |\n",
"| ``nunique()`` | Number of distinct items\n",
"| ``mean()``, ``median()`` | Mean and median |\n",
"| ``min()``, ``max()`` | Minimum and maximum |\n",
"| ``std()``, ``var()`` | Standard deviation and variance |\n",
"| ``mad()`` | Mean absolute deviation |\n",
"| ``prod()`` | Product of all items |\n",
"| ``sum()`` | Sum of all items |\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZpV-jgcddGZ_"
},
"source": [
"- Some statistics are computed from pairs of columns:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 143
},
"id": "VBaSjSmDdGaC",
"outputId": "d9b5f4e0-51ac-459b-fea8-bc8812a26bd2"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" GOOG APPL AMZN\n",
"GOOG 55.248513 13.269122 13.454445\n",
"APPL 13.269122 3.488251 3.236487\n",
"AMZN 13.454445 3.236487 3.364861"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" GOOG | \n",
" APPL | \n",
" AMZN | \n",
"
\n",
" \n",
" \n",
" \n",
" GOOG | \n",
" 55.248513 | \n",
" 13.269122 | \n",
" 13.454445 | \n",
"
\n",
" \n",
" APPL | \n",
" 13.269122 | \n",
" 3.488251 | \n",
" 3.236487 | \n",
"
\n",
" \n",
" AMZN | \n",
" 13.454445 | \n",
" 3.236487 | \n",
" 3.364861 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"sp\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"GOOG\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 24.18349209326099,\n \"min\": 13.269121918691017,\n \"max\": 55.24851276078893,\n \"num_unique_values\": 3,\n \"samples\": [\n 55.24851276078893,\n 13.269121918691017,\n 13.454444533302368\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"APPL\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5.721051535790905,\n \"min\": 3.236486896205,\n \"max\": 13.269121918691017,\n \"num_unique_values\": 3,\n \"samples\": [\n 13.269121918691017,\n 3.48825113303532,\n 3.236486896205\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"AMZN\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5.862633587955939,\n \"min\": 3.236486896205,\n \"max\": 13.454444533302368,\n \"num_unique_values\": 3,\n \"samples\": [\n 13.454444533302368,\n 3.236486896205,\n 3.364860926703335\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 42
}
],
"source": [
"sp.cov(numeric_only=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 112
},
"id": "K2lEOOAW8mX5",
"outputId": "b0e15c51-1915-4cb9-e2da-e1c65d661f9f"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" APPL AMZN\n",
"APPL 3.488251 3.236487\n",
"AMZN 3.236487 3.364861"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" APPL | \n",
" AMZN | \n",
"
\n",
" \n",
" \n",
" \n",
" APPL | \n",
" 3.488251 | \n",
" 3.236487 | \n",
"
\n",
" \n",
" AMZN | \n",
" 3.236487 | \n",
" 3.364861 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"sp[['APPL', 'AMZN']]\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"APPL\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.17802419912297515,\n \"min\": 3.236486896205,\n \"max\": 3.48825113303532,\n \"num_unique_values\": 2,\n \"samples\": [\n 3.236486896205,\n 3.48825113303532\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"AMZN\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.09077414749362132,\n \"min\": 3.236486896205,\n \"max\": 3.364860926703335,\n \"num_unique_values\": 2,\n \"samples\": [\n 3.364860926703335,\n 3.236486896205\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 43
}
],
"source": [
"sp[['APPL', 'AMZN']].cov()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xo3NLs1QdGaD"
},
"source": [
"- Some produce multiple summary statistics in one shot:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 300
},
"id": "bR30nxwSdGaE",
"outputId": "5dce12b4-bd69-4e0d-c724-e41a4dc12ad0"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" GOOG APPL AMZN\n",
"count 4.000000 4.000000 4.000000\n",
"mean 533.425003 118.509924 421.550003\n",
"std 7.432934 1.867686 1.834356\n",
"min 524.219971 116.547424 419.100006\n",
"25% 529.154984 117.099819 420.667503\n",
"50% 534.350006 118.635792 422.029999\n",
"75% 538.620025 120.045897 422.912499\n",
"max 540.780029 120.220688 423.040009"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" GOOG | \n",
" APPL | \n",
" AMZN | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 4.000000 | \n",
" 4.000000 | \n",
" 4.000000 | \n",
"
\n",
" \n",
" mean | \n",
" 533.425003 | \n",
" 118.509924 | \n",
" 421.550003 | \n",
"
\n",
" \n",
" std | \n",
" 7.432934 | \n",
" 1.867686 | \n",
" 1.834356 | \n",
"
\n",
" \n",
" min | \n",
" 524.219971 | \n",
" 116.547424 | \n",
" 419.100006 | \n",
"
\n",
" \n",
" 25% | \n",
" 529.154984 | \n",
" 117.099819 | \n",
" 420.667503 | \n",
"
\n",
" \n",
" 50% | \n",
" 534.350006 | \n",
" 118.635792 | \n",
" 422.029999 | \n",
"
\n",
" \n",
" 75% | \n",
" 538.620025 | \n",
" 120.045897 | \n",
" 422.912499 | \n",
"
\n",
" \n",
" max | \n",
" 540.780029 | \n",
" 120.220688 | \n",
" 423.040009 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"sp\",\n \"rows\": 8,\n \"fields\": [\n {\n \"column\": \"GOOG\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 244.33736825518412,\n \"min\": 4.0,\n \"max\": 540.780029,\n \"num_unique_values\": 8,\n \"samples\": [\n 533.4250030000001,\n 534.350006,\n 4.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"APPL\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 53.51923645373125,\n \"min\": 1.8676860370617223,\n \"max\": 120.220688,\n \"num_unique_values\": 8,\n \"samples\": [\n 118.50992400000001,\n 118.63579200000001,\n 4.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"AMZN\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 193.79429323398534,\n \"min\": 1.834355725235249,\n \"max\": 423.040009,\n \"num_unique_values\": 8,\n \"samples\": [\n 421.55000300000006,\n 422.02999850000003,\n 4.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 89
}
],
"source": [
"# by default, summarize numeric columns only\n",
"sp.describe()"
]
},
{
"cell_type": "markdown",
"source": [
"The `include=['O']` means including only the columns with data type 'object' in the output of the describe() method. Here, 'O' stands for object, which typically pertains to strings or mixed data types in pandas."
],
"metadata": {
"id": "SjBU1-g8HJuZ"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 174
},
"id": "YUv2dmLK8mX8",
"outputId": "e322454b-bf14-4122-c9ee-006f331043b6"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Date\n",
"count 4\n",
"unique 4\n",
"top 2015/5/1\n",
"freq 1"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Date | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 4 | \n",
"
\n",
" \n",
" unique | \n",
" 4 | \n",
"
\n",
" \n",
" top | \n",
" 2015/5/1 | \n",
"
\n",
" \n",
" freq | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"sp\",\n \"rows\": 4,\n \"fields\": [\n {\n \"column\": \"Date\",\n \"properties\": {\n \"dtype\": \"date\",\n \"min\": \"1970-01-01 00:00:00.000000001\",\n \"max\": \"2015-05-01 00:00:00\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"4\",\n \"2015/5/1\",\n \"1\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 90
}
],
"source": [
"# Python object columns can be selected using include=['O'].\n",
"sp.describe(include=['O'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "ztubCDQ-8mX9",
"outputId": "9a50f269-caad-49fb-8a63-7bac2c4fb7a8"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"\n",
"RangeIndex: 4 entries, 0 to 3\n",
"Data columns (total 4 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Date 4 non-null object \n",
" 1 GOOG 4 non-null float64\n",
" 2 APPL 4 non-null float64\n",
" 3 AMZN 4 non-null float64\n",
"dtypes: float64(3), object(1)\n",
"memory usage: 256.0+ bytes\n"
]
}
],
"source": [
"sp.info()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "mtZzbfB8gKQu"
},
"source": [
"`Series` objects' `value_counts()` method can return the frequency of distinct values it contains:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "-8AcPogegr8y",
"outputId": "5f756067-11c9-4b06-9d15-cdfaf2afcde6"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"Date\n",
"2015/5/1 1\n",
"2015/5/4 1\n",
"2015/5/5 1\n",
"2015/5/6 1\n",
"Name: count, dtype: int64"
]
},
"metadata": {},
"execution_count": 93
}
],
"source": [
"sp['Date'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_SRhspgjrzdg"
},
"source": [
"*Excercise*:\n",
"Can you calculate the standard deviation of stocks in `sp`?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "aGddi9_HsACN",
"outputId": "29eb1e7c-7ff4-46a0-f83e-cdb2d16fbe9a"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"GOOG 7.432934\n",
"APPL 1.867686\n",
"AMZN 1.834356\n",
"dtype: float64"
]
},
"metadata": {},
"execution_count": 94
}
],
"source": [
"# write your code here\n",
"sp[['GOOG', 'APPL', 'AMZN']].std()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LJMVU4nU8mW5"
},
"source": [
"---\n",
"\n",
"
\n",
"\n",
"# 5 Handling Missing Values"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 143
},
"id": "PMCPChKl8mW5",
"outputId": "84d81f35-4918-4b20-9a23-99cdb3920089"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" A B C\n",
"0 1.0 5.0 4\n",
"1 2.0 NaN 5\n",
"2 NaN NaN 6"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" 5.0 | \n",
" 4 | \n",
"
\n",
" \n",
" 1 | \n",
" 2.0 | \n",
" NaN | \n",
" 5 | \n",
"
\n",
" \n",
" 2 | \n",
" NaN | \n",
" NaN | \n",
" 6 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"variable_name": "df_w_nan",
"summary": "{\n \"name\": \"df_w_nan\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"A\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.7071067811865476,\n \"min\": 1.0,\n \"max\": 2.0,\n \"num_unique_values\": 2,\n \"samples\": [\n 2.0,\n 1.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"B\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 5.0,\n \"max\": 5.0,\n \"num_unique_values\": 1,\n \"samples\": [\n 5.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"C\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 4,\n \"max\": 6,\n \"num_unique_values\": 3,\n \"samples\": [\n 4\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 95
}
],
"source": [
"import numpy as np\n",
"\n",
"df_w_nan = pd.DataFrame({'A': [1, 2, np.nan],\n",
" 'B': [5, np.nan, np.nan],\n",
" 'C': [4, 5, 6]})\n",
"df_w_nan"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "WxrkQ2wKdGZH"
},
"source": [
"\n",
"Pandas provides several useful methods for detecting, removing, and replacing missing values in pandas data structures:\n",
"\n",
"- [`isnull()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isna.html) generates a boolean mask indicating missing values, while [`notnull()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.notna.html) produces the opposite:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 143
},
"id": "kYRcEYES8mW9",
"outputId": "0ce23f0a-1a21-4faf-d5fe-8d3cd5cca75d"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" A B C\n",
"0 False False False\n",
"1 False True False\n",
"2 True True False"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 1 | \n",
" False | \n",
" True | \n",
" False | \n",
"
\n",
" \n",
" 2 | \n",
" True | \n",
" True | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"df_w_nan\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"A\",\n \"properties\": {\n \"dtype\": \"boolean\",\n \"num_unique_values\": 2,\n \"samples\": [\n true,\n false\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"B\",\n \"properties\": {\n \"dtype\": \"boolean\",\n \"num_unique_values\": 2,\n \"samples\": [\n true,\n false\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"C\",\n \"properties\": {\n \"dtype\": \"boolean\",\n \"num_unique_values\": 1,\n \"samples\": [\n false\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 96
}
],
"source": [
"df_w_nan.isnull()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "xkXQSA9K8mW_",
"outputId": "95289e32-344c-4f24-be16-e600b50e926f"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0 True\n",
"1 True\n",
"2 False\n",
"Name: A, dtype: bool"
]
},
"metadata": {},
"execution_count": 97
}
],
"source": [
"df_w_nan.A.notnull()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JHC_6HHQ8mXB"
},
"source": [
"\n",
"- [`dropna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) returns a filtered version of the data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 81
},
"id": "FpDN3MRx8mXC",
"outputId": "42add103-652f-4b08-e213-4c398dc2af3a"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" A B C\n",
"0 1.0 5.0 4"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" 5.0 | \n",
" 4 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"df_w_nan\",\n \"rows\": 1,\n \"fields\": [\n {\n \"column\": \"A\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 1.0,\n \"max\": 1.0,\n \"num_unique_values\": 1,\n \"samples\": [\n 1.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"B\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 5.0,\n \"max\": 5.0,\n \"num_unique_values\": 1,\n \"samples\": [\n 5.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"C\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 4,\n \"max\": 4,\n \"num_unique_values\": 1,\n \"samples\": [\n 4\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 98
}
],
"source": [
"df_w_nan.dropna(axis=0) # axis=0 Drop rows which contain missing values."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 143
},
"id": "H6_ulSI-8mXD",
"outputId": "467f88cf-36bf-484a-d8f3-74b32990793b"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" C\n",
"0 4\n",
"1 5\n",
"2 6"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 4 | \n",
"
\n",
" \n",
" 1 | \n",
" 5 | \n",
"
\n",
" \n",
" 2 | \n",
" 6 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"df_w_nan\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"C\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 4,\n \"max\": 6,\n \"num_unique_values\": 3,\n \"samples\": [\n 4,\n 5,\n 6\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 99
}
],
"source": [
"df_w_nan.dropna(axis=1) # axis=1 Drop columns which contain missing values."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yivKqJG-8mXG"
},
"source": [
"\n",
"- [`fillna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) returns a copy of the data with missing values filled or imputed (set `inplace=True` to modify it in place):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 143
},
"id": "lmVV7bpl8mXI",
"outputId": "efbf5492-41f9-43c1-d24f-d4088ca1226b"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" A B C\n",
"0 1.0 5.0 4\n",
"1 2.0 NaN 5\n",
"2 NaN NaN 6"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" 5.0 | \n",
" 4 | \n",
"
\n",
" \n",
" 1 | \n",
" 2.0 | \n",
" NaN | \n",
" 5 | \n",
"
\n",
" \n",
" 2 | \n",
" NaN | \n",
" NaN | \n",
" 6 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"variable_name": "df_w_nan",
"summary": "{\n \"name\": \"df_w_nan\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"A\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.7071067811865476,\n \"min\": 1.0,\n \"max\": 2.0,\n \"num_unique_values\": 2,\n \"samples\": [\n 2.0,\n 1.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"B\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": null,\n \"min\": 5.0,\n \"max\": 5.0,\n \"num_unique_values\": 1,\n \"samples\": [\n 5.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"C\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 4,\n \"max\": 6,\n \"num_unique_values\": 3,\n \"samples\": [\n 4\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 100
}
],
"source": [
"df_w_nan"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 143
},
"id": "YHN9QNRX8mXK",
"outputId": "1f7e1037-257f-44c4-f982-3f6d296a93f9"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" A B C\n",
"0 1.0 5.0 4\n",
"1 2.0 0.0 5\n",
"2 0.0 0.0 6"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" 5.0 | \n",
" 4 | \n",
"
\n",
" \n",
" 1 | \n",
" 2.0 | \n",
" 0.0 | \n",
" 5 | \n",
"
\n",
" \n",
" 2 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 6 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"df_w_nan\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"A\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1.0,\n \"min\": 0.0,\n \"max\": 2.0,\n \"num_unique_values\": 3,\n \"samples\": [\n 1.0,\n 2.0,\n 0.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"B\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2.8867513459481287,\n \"min\": 0.0,\n \"max\": 5.0,\n \"num_unique_values\": 2,\n \"samples\": [\n 0.0,\n 5.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"C\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 4,\n \"max\": 6,\n \"num_unique_values\": 3,\n \"samples\": [\n 4,\n 5\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 102
}
],
"source": [
"# Replace all NaN elements with 0s.\n",
"df_w_nan.fillna(0)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 143
},
"id": "CPDbf5qv8mXL",
"outputId": "26f6ff4d-40c5-4f3b-ae0e-04f03eeabda2"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" A B C\n",
"0 1.0 5.0 4\n",
"1 2.0 5.0 5\n",
"2 2.0 5.0 6"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" 5.0 | \n",
" 4 | \n",
"
\n",
" \n",
" 1 | \n",
" 2.0 | \n",
" 5.0 | \n",
" 5 | \n",
"
\n",
" \n",
" 2 | \n",
" 2.0 | \n",
" 5.0 | \n",
" 6 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"df_w_nan\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"A\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.5773502691896257,\n \"min\": 1.0,\n \"max\": 2.0,\n \"num_unique_values\": 2,\n \"samples\": [\n 2.0,\n 1.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"B\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.0,\n \"min\": 5.0,\n \"max\": 5.0,\n \"num_unique_values\": 1,\n \"samples\": [\n 5.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"C\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 4,\n \"max\": 6,\n \"num_unique_values\": 3,\n \"samples\": [\n 4\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 103
}
],
"source": [
"#ffill() function is used to forward fill the missing value with the value from the previous row (column) when axis = 0 (1)\n",
"df_w_nan.fillna(method='ffill', axis=0)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "lmZwc_wpt_pj"
},
"source": [
"A more flexible way to fill or impute values (in place) is to use the assignment form of indexing:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "y4WYqzg-mIjX",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 143
},
"outputId": "825c1e82-6723-4e97-b1ef-dbdec91719cd"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" A B C\n",
"0 1.0 5.0 4\n",
"1 2.0 5.0 5\n",
"2 NaN 5.0 6"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" 5.0 | \n",
" 4 | \n",
"
\n",
" \n",
" 1 | \n",
" 2.0 | \n",
" 5.0 | \n",
" 5 | \n",
"
\n",
" \n",
" 2 | \n",
" NaN | \n",
" 5.0 | \n",
" 6 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"variable_name": "df_w_nan",
"summary": "{\n \"name\": \"df_w_nan\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"A\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.7071067811865476,\n \"min\": 1.0,\n \"max\": 2.0,\n \"num_unique_values\": 2,\n \"samples\": [\n 2.0,\n 1.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"B\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.0,\n \"min\": 5.0,\n \"max\": 5.0,\n \"num_unique_values\": 1,\n \"samples\": [\n 5.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"C\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 4,\n \"max\": 6,\n \"num_unique_values\": 3,\n \"samples\": [\n 4\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 104
}
],
"source": [
"df_w_nan.loc[df_w_nan.B.isnull(), 'B'] = df_w_nan.B.mean()\n",
"df_w_nan"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "62sWIkFldGap"
},
"source": [
"---\n",
"\n",
"
\n",
"\n",
"# 6 Computing Group-wise Summary Statistics\n",
"\n",
"\n",
"\n",
"Categorizing a dataset and applying a function to each group (whether be an aggregation or transformation) is often a critical component of a data analysis workflow. \n",
"\n",
"
\n",
"\n",
"\n",
"\n",
"Splitting data in a `DataFrame` into groups can be done by calling the `DataFrame`'s [`groupby()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) method, passing the name of the desired key column:\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 425
},
"id": "ji2znJdbdGa2",
"outputId": "b0adf8e6-c883-4797-f2f4-4c264ba84695"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Date Symbol Price Volume\n",
"0 2015/5/1 GOOG 537.900024 1768200\n",
"1 2015/5/4 GOOG 540.780029 1308000\n",
"2 2015/5/5 GOOG 530.799988 1383100\n",
"3 2015/5/6 GOOG 524.219971 1567000\n",
"4 2015/5/1 APPL 120.220688 58512600\n",
"5 2015/5/4 APPL 119.987633 50988300\n",
"6 2015/5/5 APPL 117.283951 49271400\n",
"7 2015/5/6 APPL 116.547424 72141000\n",
"8 2015/5/1 AMZN 422.869995 3565800\n",
"9 2015/5/4 AMZN 423.040009 2270400\n",
"10 2015/5/5 AMZN 421.190002 2856400\n",
"11 2015/5/6 AMZN 419.100006 2552500"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Date | \n",
" Symbol | \n",
" Price | \n",
" Volume | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2015/5/1 | \n",
" GOOG | \n",
" 537.900024 | \n",
" 1768200 | \n",
"
\n",
" \n",
" 1 | \n",
" 2015/5/4 | \n",
" GOOG | \n",
" 540.780029 | \n",
" 1308000 | \n",
"
\n",
" \n",
" 2 | \n",
" 2015/5/5 | \n",
" GOOG | \n",
" 530.799988 | \n",
" 1383100 | \n",
"
\n",
" \n",
" 3 | \n",
" 2015/5/6 | \n",
" GOOG | \n",
" 524.219971 | \n",
" 1567000 | \n",
"
\n",
" \n",
" 4 | \n",
" 2015/5/1 | \n",
" APPL | \n",
" 120.220688 | \n",
" 58512600 | \n",
"
\n",
" \n",
" 5 | \n",
" 2015/5/4 | \n",
" APPL | \n",
" 119.987633 | \n",
" 50988300 | \n",
"
\n",
" \n",
" 6 | \n",
" 2015/5/5 | \n",
" APPL | \n",
" 117.283951 | \n",
" 49271400 | \n",
"
\n",
" \n",
" 7 | \n",
" 2015/5/6 | \n",
" APPL | \n",
" 116.547424 | \n",
" 72141000 | \n",
"
\n",
" \n",
" 8 | \n",
" 2015/5/1 | \n",
" AMZN | \n",
" 422.869995 | \n",
" 3565800 | \n",
"
\n",
" \n",
" 9 | \n",
" 2015/5/4 | \n",
" AMZN | \n",
" 423.040009 | \n",
" 2270400 | \n",
"
\n",
" \n",
" 10 | \n",
" 2015/5/5 | \n",
" AMZN | \n",
" 421.190002 | \n",
" 2856400 | \n",
"
\n",
" \n",
" 11 | \n",
" 2015/5/6 | \n",
" AMZN | \n",
" 419.100006 | \n",
" 2552500 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"variable_name": "stock",
"summary": "{\n \"name\": \"stock\",\n \"rows\": 12,\n \"fields\": [\n {\n \"column\": \"Date\",\n \"properties\": {\n \"dtype\": \"object\",\n \"num_unique_values\": 4,\n \"samples\": [\n \"2015/5/4\",\n \"2015/5/6\",\n \"2015/5/1\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Symbol\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"GOOG\",\n \"APPL\",\n \"AMZN\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Price\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 183.11895671451376,\n \"min\": 116.547424,\n \"max\": 540.780029,\n \"num_unique_values\": 12,\n \"samples\": [\n 421.190002,\n 423.040009,\n 537.900024\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Volume\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 27902924,\n \"min\": 1308000,\n \"max\": 72141000,\n \"num_unique_values\": 12,\n \"samples\": [\n 2856400,\n 2270400,\n 1768200\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 44
}
],
"source": [
"#! wget -q -O stock.csv \"https://raw.githubusercontent.com/justinjiajia/datafiles/main/pricevolume_sub.csv\"\n",
"#stock = pd.read_csv(\"stock.csv\")\n",
"stock = pd.read_csv(\"https://raw.githubusercontent.com/justinjiajia/datafiles/main/pricevolume_sub.csv\")\n",
"stock"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "pu1SX3u8dGa4",
"outputId": "16880a51-126c-4d43-b8e2-dda3f0398a20"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"AMZN\n",
" Date Symbol Price Volume\n",
"8 2015/5/1 AMZN 422.869995 3565800\n",
"9 2015/5/4 AMZN 423.040009 2270400\n",
"10 2015/5/5 AMZN 421.190002 2856400\n",
"11 2015/5/6 AMZN 419.100006 2552500\n",
"APPL\n",
" Date Symbol Price Volume\n",
"4 2015/5/1 APPL 120.220688 58512600\n",
"5 2015/5/4 APPL 119.987633 50988300\n",
"6 2015/5/5 APPL 117.283951 49271400\n",
"7 2015/5/6 APPL 116.547424 72141000\n",
"GOOG\n",
" Date Symbol Price Volume\n",
"0 2015/5/1 GOOG 537.900024 1768200\n",
"1 2015/5/4 GOOG 540.780029 1308000\n",
"2 2015/5/5 GOOG 530.799988 1383100\n",
"3 2015/5/6 GOOG 524.219971 1567000\n"
]
}
],
"source": [
"stock_by_symbol = stock.groupby('Symbol')\n",
"\n",
"# what was returned is a GroupBy object\n",
"# wrap it in a loop to have a peek at the resulting grouping\n",
"for key, group in stock_by_symbol:\n",
" print(key)\n",
" print(group)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "S335A83LdGa6",
"outputId": "3174bd76-8a29-4625-cebf-45d207043727"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"('2015/5/1', 'AMZN')\n",
" Date Symbol Price Volume\n",
"8 2015/5/1 AMZN 422.869995 3565800\n",
"('2015/5/1', 'APPL')\n",
" Date Symbol Price Volume\n",
"4 2015/5/1 APPL 120.220688 58512600\n",
"('2015/5/1', 'GOOG')\n",
" Date Symbol Price Volume\n",
"0 2015/5/1 GOOG 537.900024 1768200\n",
"('2015/5/4', 'AMZN')\n",
" Date Symbol Price Volume\n",
"9 2015/5/4 AMZN 423.040009 2270400\n",
"('2015/5/4', 'APPL')\n",
" Date Symbol Price Volume\n",
"5 2015/5/4 APPL 119.987633 50988300\n",
"('2015/5/4', 'GOOG')\n",
" Date Symbol Price Volume\n",
"1 2015/5/4 GOOG 540.780029 1308000\n",
"('2015/5/5', 'AMZN')\n",
" Date Symbol Price Volume\n",
"10 2015/5/5 AMZN 421.190002 2856400\n",
"('2015/5/5', 'APPL')\n",
" Date Symbol Price Volume\n",
"6 2015/5/5 APPL 117.283951 49271400\n",
"('2015/5/5', 'GOOG')\n",
" Date Symbol Price Volume\n",
"2 2015/5/5 GOOG 530.799988 1383100\n",
"('2015/5/6', 'AMZN')\n",
" Date Symbol Price Volume\n",
"11 2015/5/6 AMZN 419.100006 2552500\n",
"('2015/5/6', 'APPL')\n",
" Date Symbol Price Volume\n",
"7 2015/5/6 APPL 116.547424 72141000\n",
"('2015/5/6', 'GOOG')\n",
" Date Symbol Price Volume\n",
"3 2015/5/6 GOOG 524.219971 1567000\n"
]
}
],
"source": [
"stock_by_date_symbol = stock.groupby(['Date', 'Symbol'])\n",
"\n",
"for key, group in stock_by_date_symbol:\n",
" print(key)\n",
" print(group)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "WDwA2lHpdGbB"
},
"source": [
"\n",
"\n",
"Pandas provides [many common aggregations](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html#computations-descriptive-stats) that can be applied to `GroupBy` objects and return a scalar per group in the apply/combine steps:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "z_mfX9WLcPRu",
"outputId": "5460927b-8e65-41f9-e4a5-f0dd6957a7f3"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['Date',\n",
" 'Price',\n",
" 'Symbol',\n",
" 'Volume',\n",
" '_DataFrameGroupBy__examples_dataframe_doc',\n",
" '__annotations__',\n",
" '__class__',\n",
" '__class_getitem__',\n",
" '__delattr__',\n",
" '__dict__',\n",
" '__dir__',\n",
" '__doc__',\n",
" '__eq__',\n",
" '__format__',\n",
" '__ge__',\n",
" '__getattr__',\n",
" '__getattribute__',\n",
" '__getitem__',\n",
" '__gt__',\n",
" '__hash__',\n",
" '__init__',\n",
" '__init_subclass__',\n",
" '__iter__',\n",
" '__le__',\n",
" '__len__',\n",
" '__lt__',\n",
" '__module__',\n",
" '__ne__',\n",
" '__new__',\n",
" '__orig_bases__',\n",
" '__parameters__',\n",
" '__reduce__',\n",
" '__reduce_ex__',\n",
" '__repr__',\n",
" '__setattr__',\n",
" '__sizeof__',\n",
" '__slots__',\n",
" '__str__',\n",
" '__subclasshook__',\n",
" '__weakref__',\n",
" '_accessors',\n",
" '_agg_examples_doc',\n",
" '_agg_general',\n",
" '_agg_py_fallback',\n",
" '_aggregate_frame',\n",
" '_aggregate_with_numba',\n",
" '_apply_filter',\n",
" '_apply_to_column_groupbys',\n",
" '_ascending_count',\n",
" '_bool_agg',\n",
" '_cache',\n",
" '_choose_path',\n",
" '_concat_objects',\n",
" '_constructor',\n",
" '_cumcount_array',\n",
" '_cython_agg_general',\n",
" '_cython_transform',\n",
" '_define_paths',\n",
" '_descending_count',\n",
" '_dir_additions',\n",
" '_dir_deletions',\n",
" '_fill',\n",
" '_get_cythonized_result',\n",
" '_get_data_to_aggregate',\n",
" '_get_index',\n",
" '_get_indices',\n",
" '_gotitem',\n",
" '_hidden_attrs',\n",
" '_indexed_output_to_ndframe',\n",
" '_insert_inaxis_grouper',\n",
" '_internal_names',\n",
" '_internal_names_set',\n",
" '_is_protocol',\n",
" '_iterate_column_groupbys',\n",
" '_iterate_slices',\n",
" '_make_mask_from_int',\n",
" '_make_mask_from_list',\n",
" '_make_mask_from_positional_indexer',\n",
" '_make_mask_from_slice',\n",
" '_make_mask_from_tuple',\n",
" '_mask_selected_obj',\n",
" '_maybe_transpose_result',\n",
" '_nth',\n",
" '_numba_agg_general',\n",
" '_numba_prep',\n",
" '_obj_1d_constructor',\n",
" '_obj_with_exclusions',\n",
" '_op_via_apply',\n",
" '_positional_selector',\n",
" '_python_agg_general',\n",
" '_python_apply_general',\n",
" '_reindex_output',\n",
" '_reset_cache',\n",
" '_selected_obj',\n",
" '_selection',\n",
" '_selection_list',\n",
" '_set_result_index_ordered',\n",
" '_transform',\n",
" '_transform_general',\n",
" '_transform_with_numba',\n",
" '_value_counts',\n",
" '_wrap_agged_manager',\n",
" '_wrap_aggregated_output',\n",
" '_wrap_applied_output',\n",
" '_wrap_applied_output_series',\n",
" '_wrap_transform_fast_result',\n",
" 'agg',\n",
" 'aggregate',\n",
" 'all',\n",
" 'any',\n",
" 'apply',\n",
" 'bfill',\n",
" 'boxplot',\n",
" 'corr',\n",
" 'corrwith',\n",
" 'count',\n",
" 'cov',\n",
" 'cumcount',\n",
" 'cummax',\n",
" 'cummin',\n",
" 'cumprod',\n",
" 'cumsum',\n",
" 'describe',\n",
" 'diff',\n",
" 'dtypes',\n",
" 'ewm',\n",
" 'expanding',\n",
" 'ffill',\n",
" 'fillna',\n",
" 'filter',\n",
" 'first',\n",
" 'get_group',\n",
" 'groups',\n",
" 'head',\n",
" 'hist',\n",
" 'idxmax',\n",
" 'idxmin',\n",
" 'indices',\n",
" 'last',\n",
" 'max',\n",
" 'mean',\n",
" 'median',\n",
" 'min',\n",
" 'ndim',\n",
" 'ngroup',\n",
" 'ngroups',\n",
" 'nth',\n",
" 'nunique',\n",
" 'ohlc',\n",
" 'pct_change',\n",
" 'pipe',\n",
" 'plot',\n",
" 'prod',\n",
" 'quantile',\n",
" 'rank',\n",
" 'resample',\n",
" 'rolling',\n",
" 'sample',\n",
" 'sem',\n",
" 'shift',\n",
" 'size',\n",
" 'skew',\n",
" 'std',\n",
" 'sum',\n",
" 'tail',\n",
" 'take',\n",
" 'transform',\n",
" 'value_counts',\n",
" 'var']"
]
},
"metadata": {},
"execution_count": 109
}
],
"source": [
"dir(stock_by_symbol)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 174
},
"id": "eCB8zLYgdGbD",
"outputId": "d963268a-3340-42c2-9ef2-a2f7a83444df"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Price Volume\n",
"Symbol \n",
"AMZN 421.550003 2811275.0\n",
"APPL 118.509924 57728325.0\n",
"GOOG 533.425003 1506575.0"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Price | \n",
" Volume | \n",
"
\n",
" \n",
" Symbol | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" AMZN | \n",
" 421.550003 | \n",
" 2811275.0 | \n",
"
\n",
" \n",
" APPL | \n",
" 118.509924 | \n",
" 57728325.0 | \n",
"
\n",
" \n",
" GOOG | \n",
" 533.425003 | \n",
" 1506575.0 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"stock_by_symbol\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"Symbol\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"AMZN\",\n \"APPL\",\n \"GOOG\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Price\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 214.671775775214,\n \"min\": 118.509924,\n \"max\": 533.4250030000001,\n \"num_unique_values\": 3,\n \"samples\": [\n 421.550003,\n 118.509924,\n 533.4250030000001\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Volume\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 32089639.54262861,\n \"min\": 1506575.0,\n \"max\": 57728325.0,\n \"num_unique_values\": 3,\n \"samples\": [\n 2811275.0,\n 57728325.0,\n 1506575.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 110
}
],
"source": [
"# can only apply to numeric columns\n",
"stock_by_symbol.mean(numeric_only=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 174
},
"id": "mfaA9Uyn0cI5",
"outputId": "ec274c92-0a2b-4c3d-ce4b-6f888a5221a9"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Date Price Volume\n",
"Symbol \n",
"AMZN 2015/5/1 419.100006 2270400\n",
"APPL 2015/5/1 116.547424 49271400\n",
"GOOG 2015/5/1 524.219971 1308000"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Date | \n",
" Price | \n",
" Volume | \n",
"
\n",
" \n",
" Symbol | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" AMZN | \n",
" 2015/5/1 | \n",
" 419.100006 | \n",
" 2270400 | \n",
"
\n",
" \n",
" APPL | \n",
" 2015/5/1 | \n",
" 116.547424 | \n",
" 49271400 | \n",
"
\n",
" \n",
" GOOG | \n",
" 2015/5/1 | \n",
" 524.219971 | \n",
" 1308000 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"stock_by_symbol\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"Symbol\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"AMZN\",\n \"APPL\",\n \"GOOG\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Date\",\n \"properties\": {\n \"dtype\": \"object\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"2015/5/1\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Price\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 211.65426899149122,\n \"min\": 116.547424,\n \"max\": 524.219971,\n \"num_unique_values\": 3,\n \"samples\": [\n 419.100006\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Volume\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 27418083,\n \"min\": 1308000,\n \"max\": 49271400,\n \"num_unique_values\": 3,\n \"samples\": [\n 2270400\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 47
}
],
"source": [
"# can also apply to categorical columns\n",
"stock_by_symbol.min()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 174
},
"id": "o3iyQqHJ0nD0",
"outputId": "8c11d8ae-faa1-4d42-f314-41b2eb8ff431"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Date Price Volume\n",
"Symbol \n",
"AMZN 2015/5/1 422.869995 3565800\n",
"APPL 2015/5/1 120.220688 58512600\n",
"GOOG 2015/5/1 537.900024 1768200"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Date | \n",
" Price | \n",
" Volume | \n",
"
\n",
" \n",
" Symbol | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" AMZN | \n",
" 2015/5/1 | \n",
" 422.869995 | \n",
" 3565800 | \n",
"
\n",
" \n",
" APPL | \n",
" 2015/5/1 | \n",
" 120.220688 | \n",
" 58512600 | \n",
"
\n",
" \n",
" GOOG | \n",
" 2015/5/1 | \n",
" 537.900024 | \n",
" 1768200 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"stock_by_symbol\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"Symbol\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"AMZN\",\n \"APPL\",\n \"GOOG\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Date\",\n \"properties\": {\n \"dtype\": \"object\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"2015/5/1\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Price\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 215.74851807939186,\n \"min\": 120.220688,\n \"max\": 537.900024,\n \"num_unique_values\": 3,\n \"samples\": [\n 422.869995\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Volume\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 32254997,\n \"min\": 1768200,\n \"max\": 58512600,\n \"num_unique_values\": 3,\n \"samples\": [\n 3565800\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 48
}
],
"source": [
"# select the first record of each group\n",
"stock_by_symbol.first()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Nz6Nb0-GdGbH"
},
"source": [
"To suppress using group keys as indices in the aggregated output, we can pass `as_index=False` to `groupby()` when first creating the `GroupBy` object:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 143
},
"id": "x5YRkPy2dGbH",
"outputId": "260fd0f6-079f-4357-d92e-7177d5b5bc7a"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Symbol Price Volume\n",
"0 AMZN 421.550003 2811275.0\n",
"1 APPL 118.509924 57728325.0\n",
"2 GOOG 533.425003 1506575.0"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Symbol | \n",
" Price | \n",
" Volume | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" AMZN | \n",
" 421.550003 | \n",
" 2811275.0 | \n",
"
\n",
" \n",
" 1 | \n",
" APPL | \n",
" 118.509924 | \n",
" 57728325.0 | \n",
"
\n",
" \n",
" 2 | \n",
" GOOG | \n",
" 533.425003 | \n",
" 1506575.0 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"stock_by_symbol\",\n \"rows\": 3,\n \"fields\": [\n {\n \"column\": \"Symbol\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"AMZN\",\n \"APPL\",\n \"GOOG\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Price\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 214.671775775214,\n \"min\": 118.509924,\n \"max\": 533.4250030000001,\n \"num_unique_values\": 3,\n \"samples\": [\n 421.550003,\n 118.509924,\n 533.4250030000001\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Volume\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 32089639.54262861,\n \"min\": 1506575.0,\n \"max\": 57728325.0,\n \"num_unique_values\": 3,\n \"samples\": [\n 2811275.0,\n 57728325.0,\n 1506575.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 49
}
],
"source": [
"stock_by_symbol = stock.groupby('Symbol', as_index=False)\n",
"stock_by_symbol.mean(numeric_only=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "pcMUIpx1s4BW"
},
"source": [
"*In-Class Exercise:*\n",
"\n",
"What is the most popular names for US babies?\n",
"\n",
"Below codes allows you to retrieve US baby names from 2004~2014, please apply what you have learnt to get the 5 most popular names during this period with the counts of each name? Hint: you may use `.sort_values(by=\"Column\", ascending=False)` method to sort a DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 423
},
"id": "2JIpEZY7tOgA",
"outputId": "1cf8a561-c172-4335-df7e-c933ed674983"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Unnamed: 0 Id Name Year Gender State Count\n",
"0 11349 11350 Emma 2004 F AK 62\n",
"1 11350 11351 Madison 2004 F AK 48\n",
"2 11351 11352 Hannah 2004 F AK 46\n",
"3 11352 11353 Grace 2004 F AK 44\n",
"4 11353 11354 Emily 2004 F AK 41\n",
"... ... ... ... ... ... ... ...\n",
"1016390 5647421 5647422 Seth 2014 M WY 5\n",
"1016391 5647422 5647423 Spencer 2014 M WY 5\n",
"1016392 5647423 5647424 Tyce 2014 M WY 5\n",
"1016393 5647424 5647425 Victor 2014 M WY 5\n",
"1016394 5647425 5647426 Waylon 2014 M WY 5\n",
"\n",
"[1016395 rows x 7 columns]"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unnamed: 0 | \n",
" Id | \n",
" Name | \n",
" Year | \n",
" Gender | \n",
" State | \n",
" Count | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 11349 | \n",
" 11350 | \n",
" Emma | \n",
" 2004 | \n",
" F | \n",
" AK | \n",
" 62 | \n",
"
\n",
" \n",
" 1 | \n",
" 11350 | \n",
" 11351 | \n",
" Madison | \n",
" 2004 | \n",
" F | \n",
" AK | \n",
" 48 | \n",
"
\n",
" \n",
" 2 | \n",
" 11351 | \n",
" 11352 | \n",
" Hannah | \n",
" 2004 | \n",
" F | \n",
" AK | \n",
" 46 | \n",
"
\n",
" \n",
" 3 | \n",
" 11352 | \n",
" 11353 | \n",
" Grace | \n",
" 2004 | \n",
" F | \n",
" AK | \n",
" 44 | \n",
"
\n",
" \n",
" 4 | \n",
" 11353 | \n",
" 11354 | \n",
" Emily | \n",
" 2004 | \n",
" F | \n",
" AK | \n",
" 41 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 1016390 | \n",
" 5647421 | \n",
" 5647422 | \n",
" Seth | \n",
" 2014 | \n",
" M | \n",
" WY | \n",
" 5 | \n",
"
\n",
" \n",
" 1016391 | \n",
" 5647422 | \n",
" 5647423 | \n",
" Spencer | \n",
" 2014 | \n",
" M | \n",
" WY | \n",
" 5 | \n",
"
\n",
" \n",
" 1016392 | \n",
" 5647423 | \n",
" 5647424 | \n",
" Tyce | \n",
" 2014 | \n",
" M | \n",
" WY | \n",
" 5 | \n",
"
\n",
" \n",
" 1016393 | \n",
" 5647424 | \n",
" 5647425 | \n",
" Victor | \n",
" 2014 | \n",
" M | \n",
" WY | \n",
" 5 | \n",
"
\n",
" \n",
" 1016394 | \n",
" 5647425 | \n",
" 5647426 | \n",
" Waylon | \n",
" 2014 | \n",
" M | \n",
" WY | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
1016395 rows × 7 columns
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"variable_name": "baby_names"
}
},
"metadata": {},
"execution_count": 3
}
],
"source": [
"baby_names = pd.read_csv('https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv')\n",
"baby_names"
]
},
{
"cell_type": "code",
"source": [
"#write your code here\n",
"group_baby_names = baby_names.groupby('Name')"
],
"metadata": {
"id": "Xrp2SCEqLKlV"
},
"execution_count": 4,
"outputs": []
},
{
"cell_type": "code",
"source": [
"group_baby_names.sum(numeric_only=True).sort_values(by=\"Count\", ascending=False).iloc[:5,[3]]"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 237
},
"id": "KaSgCPzQxn6A",
"outputId": "514866e6-4e89-444e-f2d0-1a3c022ddf88"
},
"execution_count": 5,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Count\n",
"Name \n",
"Jacob 242874\n",
"Emma 214852\n",
"Michael 214405\n",
"Ethan 209277\n",
"Isabella 204798"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Count | \n",
"
\n",
" \n",
" Name | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Jacob | \n",
" 242874 | \n",
"
\n",
" \n",
" Emma | \n",
" 214852 | \n",
"
\n",
" \n",
" Michael | \n",
" 214405 | \n",
"
\n",
" \n",
" Ethan | \n",
" 209277 | \n",
"
\n",
" \n",
" Isabella | \n",
" 204798 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"group_baby_names\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"Name\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"Emma\",\n \"Isabella\",\n \"Michael\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Count\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 14908,\n \"min\": 204798,\n \"max\": 242874,\n \"num_unique_values\": 5,\n \"samples\": [\n 214852,\n 204798,\n 214405\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 5
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "eQAMrXLnRNhp"
},
"source": [
"*Additional*: What are the 5 most popular male/female baby names?"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"id": "MIOHSZeOWKk4"
},
"outputs": [],
"source": [
"# write your code here\n",
"group_gender_baby_names = baby_names.groupby(['Gender','Name'])"
]
},
{
"cell_type": "code",
"source": [
"name_count_gender = group_gender_baby_names.sum(numeric_only=True)"
],
"metadata": {
"id": "w2fTmNoTyn_y"
},
"execution_count": 7,
"outputs": []
},
{
"cell_type": "code",
"source": [
"name_count_gender.loc['F'].sort_values(by=\"Count\", ascending=False).iloc[:5,[3]]"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 237
},
"id": "BH5bYXiEyxY1",
"outputId": "3b4606c0-e456-4abf-a517-7854a8650d8a"
},
"execution_count": 8,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Count\n",
"Name \n",
"Emma 214757\n",
"Isabella 204742\n",
"Sophia 191421\n",
"Emily 190211\n",
"Olivia 187962"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Count | \n",
"
\n",
" \n",
" Name | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Emma | \n",
" 214757 | \n",
"
\n",
" \n",
" Isabella | \n",
" 204742 | \n",
"
\n",
" \n",
" Sophia | \n",
" 191421 | \n",
"
\n",
" \n",
" Emily | \n",
" 190211 | \n",
"
\n",
" \n",
" Olivia | \n",
" 187962 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"name_count_gender\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"Name\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"Isabella\",\n \"Olivia\",\n \"Sophia\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Count\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 11519,\n \"min\": 187962,\n \"max\": 214757,\n \"num_unique_values\": 5,\n \"samples\": [\n 204742,\n 187962,\n 191421\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 8
}
]
},
{
"cell_type": "code",
"source": [
"name_count_gender.loc['M'].sort_values(by=\"Count\", ascending=False).iloc[:5,[3]]"
],
"metadata": {
"id": "OuRPQAHhzToK",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 237
},
"outputId": "5408c993-afb9-4841-cf44-562b2bcfd146"
},
"execution_count": 11,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Count\n",
"Name \n",
"Jacob 242706\n",
"Michael 214228\n",
"Ethan 209153\n",
"William 197796\n",
"Joshua 191444"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Count | \n",
"
\n",
" \n",
" Name | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Jacob | \n",
" 242706 | \n",
"
\n",
" \n",
" Michael | \n",
" 214228 | \n",
"
\n",
" \n",
" Ethan | \n",
" 209153 | \n",
"
\n",
" \n",
" William | \n",
" 197796 | \n",
"
\n",
" \n",
" Joshua | \n",
" 191444 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"name_count_gender\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"Name\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"Michael\",\n \"Joshua\",\n \"Ethan\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"Count\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 19848,\n \"min\": 191444,\n \"max\": 242706,\n \"num_unique_values\": 5,\n \"samples\": [\n 214228,\n 191444,\n 209153\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 11
}
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"hide_input": false,
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 0
}