468 lines
12 KiB
Plaintext
468 lines
12 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "c9eac7d0-4f4c-49e7-a335-b8ca8de34f20",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Assignment 2:\n",
|
|
"In this assignment, you are given a dataset about bodyfat measure collected from a clinic. As a nutritionist, you want to use regression model to check whether your patients are in danger of high bodyfat.\n",
|
|
"The data contains physical measurements of patients:\n",
|
|
"- explantory variables: Age (in year), Weight (in pounds), Height (in inches), and BMI (Body Mass Index).\n",
|
|
"- response variable: bodyfat (in %)\n",
|
|
"\n",
|
|
"Complete the given tasks below. (Notice each of the missing code to be filled is a single line command, more than one command line will be downgraded)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "0f79e7b9-1fff-4bef-9fe8-8e3eb54c68cb",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"<h1 style=\"color:purple\">Author</h1>\n",
|
|
"\n",
|
|
" \n",
|
|
"- Name: \n",
|
|
"- Student ID: "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "d4725f61-cb83-4e3c-bc31-1a3992dadd28",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Pre-loaded packages and functions"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "d7e90f45",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#### Pandas is for using data structures\n",
|
|
"import pandas as pd\n",
|
|
"# statsmodels contain modules for regression and time series analysis\n",
|
|
"import statsmodels.api as sm\n",
|
|
"# numpy is for numerical computing of array and matrix\n",
|
|
"import numpy as np\n",
|
|
"# Matplotlib, Seaborn: plotting package\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"import seaborn as sns \n",
|
|
"# matplotlib Showing the plot right after the current code \n",
|
|
"%matplotlib inline\n",
|
|
"import warnings\n",
|
|
"warnings.filterwarnings('ignore')\n",
|
|
"# basic statistics package\n",
|
|
"import scipy.stats as stats\n",
|
|
"from statsmodels.stats.outliers_influence import variance_inflation_factor\n",
|
|
"import datetime"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "d8983cbf-0937-443f-ba43-d18956ad4a3e",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def outlier(dataframe,model,Type='all'):\n",
|
|
" A = dataframe.copy()\n",
|
|
" A = A.dropna()\n",
|
|
" A.index = range(1,A.shape[0]+1)\n",
|
|
" #A.index = range(0,A.shape[0])\n",
|
|
" studentized_residuals = model.get_influence().resid_studentized_internal\n",
|
|
" if Type == 'neg':\n",
|
|
" return(A[studentized_residuals<-2])\n",
|
|
" elif Type == 'posi':\n",
|
|
" return(A[studentized_residuals>2])\n",
|
|
" else:\n",
|
|
" return(A[np.abs(studentized_residuals)>2])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "b47cb1ec-efe7-4124-a653-a9e942ae5e54",
|
|
"metadata": {
|
|
"id": "7NIZKrveWDbd"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from statsmodels.stats.outliers_influence import variance_inflation_factor\n",
|
|
"def getvif(X):\n",
|
|
" X = sm.add_constant(X)\n",
|
|
" vif = pd.DataFrame()\n",
|
|
" vif[\"VIF\"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]\n",
|
|
" vif[\"Predictors\"] = X.columns\n",
|
|
" return(vif.drop(index = 0).round(2)) "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "5bd86624-4cc8-4790-85d5-826a91ef02aa",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "760c8202-fcaa-4e5f-bbb1-35f972fb4fd6",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Getting the data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "16326102",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"data = pd.read_csv(\"https://drive.google.com/uc?export=download&id=1r6Za0azxHvJpjUA6lgMJyBqYIWRljmz8\")\n",
|
|
"data.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "7fd1d118",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"data.shape"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "9496e740-2f7c-45cc-9409-a029d242a732",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Preliminary study"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "52d51f09-1993-47c5-83c8-4d874d0ddaf1",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# scatter plot matrix of the whole data\n",
|
|
"sns_plot = sns.pairplot(data)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "931e81f6-e2f6-4f07-ab1c-485f598f8a93",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Task 1: Compute the correlation matrix; which variable has the strongest correlation with bodyfat? [5pts]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "6fe53540-95ce-4286-a0b4-0c35437dbe0c",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"$$$Code_Here$$$"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "7d5d3504-6a3c-486f-afa2-639a4f85e015",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "5a155616-4e5b-4992-8a04-467c85ae2977",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Answer:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "7688f2a9-8c5e-4fae-83f3-8b4b1ef7a799",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Task 2: Split the data into train and test set by random (use 25 as the random seed/state) [5pts]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "4ae78c9d-deaa-4998-8928-5437c437e285",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"train = $$$Code_Here$$$\n",
|
|
"test = $$$Code_Here$$$\n",
|
|
"train.shape, test.shape"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "1d82fae0-ce3c-4e4b-92f8-69657b421f74",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Model Building"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "e5d20382-fd4c-4030-9d43-677f267b447b",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Task 3: Using the train set, fit a simple regression model for bodyfat by the variable from Task1 (with max correlation) [5pts]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "4994daf5-b40b-41b3-93fd-ab3f90568c91",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Y = train[\"bodyfat\"]\n",
|
|
"X = $$$Code_Here$$$\n",
|
|
"SLR = $$$Code_Here$$$\n",
|
|
"print(SLR.summary())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "8ba21618-7c3a-4baf-861f-cbdbb7cc7f05",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Task 4: Using the trian set, fit a multiple regression model for bodyfat by all of the explanatory variables (i.e. age, weight, height and bmi) [5pts]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "500ee96e-c8db-4053-ad0b-4097ae291696",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Y = train[\"bodyfat\"]\n",
|
|
"X = $$$Code_Here$$$\n",
|
|
"MLR_all = $$$Code_Here$$$\n",
|
|
"print(MLR_all.summary())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "af58180f-45a0-4cc9-b042-3845996d683e",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# VIF of all predictors\n",
|
|
"getvif(X)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "c2cebb64-f6e0-41b6-96be-09f218ba05a3",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Task 5: Multicollinearity is reflected from above. Does bmi cause the problem? Briefly explain using your understanding about bmi.[5pts]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "87def1eb-ba5c-49a9-bfdf-48b3af5e1a81",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Answer: "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "dc60e40f-72b8-43a5-8b70-40fec8cc4f4d",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Task 6: Using the train set, fit a multiple regression model for bodyfat by all of the explanatory variables except bmi [5pts]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "73499ab4-b937-4b30-a647-aeedc20d929e",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Y = train[\"bodyfat\"]\n",
|
|
"X = $$$Code_Here$$$\n",
|
|
"MLR = $$$Code_Here$$$\n",
|
|
"print(MLR.summary())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "883671ff-79ce-4c80-941c-957b0c8bcde2",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"getvif(X)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "742ef24b-3b03-439d-9d18-47c1626efe05",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Compare the predictive power between SLR and MLR using the test set. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "708e9eed-0b88-49d1-93a1-fba94254b597",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Task 7: Compute the RMSE for MLR [5pts]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "469672cd-2db3-49eb-b137-c6d63033a55f",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Test_X_SLR = test['bmi']\n",
|
|
"Test_X_MLR = $$$Code_Here$$$\n",
|
|
"\n",
|
|
"Test_Y_SLR = SLR.predict(sm.add_constant(Test_X_SLR))\n",
|
|
"Test_Y_MLR = $$$Code_Here$$$"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "f077e2cf-4cad-4578-90bb-a63c32ddda18",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Test_Y = test[\"bodyfat\"]\n",
|
|
"\n",
|
|
"from sklearn.metrics import mean_squared_error\n",
|
|
"rmse_SLR = np.sqrt(mean_squared_error(Test_Y, Test_Y_SLR))\n",
|
|
"rmse_MLR = $$$Code_Here$$$\n",
|
|
"print(\"RMSE for test set (SLR): \", rmse_SLR)\n",
|
|
"print(\"RMSE for test set (MLR): \", rmse_MLR)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "1659a34e-8080-4e92-be34-717acf81c82c",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Final Model and application"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "89256526-a134-4e7d-9a84-1f7dc331bfc2",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Task 8: Refit the better model from Task7 using the full dataset. [5pts]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "84da494e-36bf-4f96-9376-3fb43e8df704",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Y = data[\"bodyfat\"]\n",
|
|
"X = $$$Code_Here$$$\n",
|
|
"Final = $$$Code_Here$$$\n",
|
|
"print(Final.summary())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "539487cb-0e72-4a2b-85cd-6dbac1d2e321",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"### Task 9: Predict the body fat for a new patient using the final model from Task 8: age=35, weight=170 pounds. Height=72 inches, BMI=23 [5pts]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "71aacafa-0cf1-40ed-8ee9-d5d09ebf26ab",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"$$$Code_Here$$$"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "37f640ce-dfd6-42c9-90d5-13b05769448f",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"### Task 10: by using residuals (from final model) outlier analysis, report those patients are in danger (i.e. % of fat is much higher than what we expected from the final model) [5pts]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "f06aa2bd-9836-46c2-8e8d-4a876c9ae253",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"$$$Code_Here$$$"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "149424ee-f928-479b-8b07-19c3ae88f751",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.9"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|