{ "cells": [ { "cell_type": "markdown", "id": "c9eac7d0-4f4c-49e7-a335-b8ca8de34f20", "metadata": {}, "source": [ "# Assignment 2:\n", "In this assignment, you are given a dataset about bodyfat measure collected from a clinic. As a nutritionist, you want to use regression model to check whether your patients are in danger of high bodyfat.\n", "The data contains physical measurements of patients:\n", "- explantory variables: Age (in year), Weight (in pounds), Height (in inches), and BMI (Body Mass Index).\n", "- response variable: bodyfat (in %)\n", "\n", "Complete the given tasks below. (Notice each of the missing code to be filled is a single line command, more than one command line will be downgraded)" ] }, { "cell_type": "markdown", "id": "0f79e7b9-1fff-4bef-9fe8-8e3eb54c68cb", "metadata": { "tags": [] }, "source": [ "

Author

\n", "\n", " \n", "- Name: \n", "- Student ID: " ] }, { "cell_type": "markdown", "id": "d4725f61-cb83-4e3c-bc31-1a3992dadd28", "metadata": {}, "source": [ "## Pre-loaded packages and functions" ] }, { "cell_type": "code", "execution_count": null, "id": "d7e90f45", "metadata": {}, "outputs": [], "source": [ "#### Pandas is for using data structures\n", "import pandas as pd\n", "# statsmodels contain modules for regression and time series analysis\n", "import statsmodels.api as sm\n", "# numpy is for numerical computing of array and matrix\n", "import numpy as np\n", "# Matplotlib, Seaborn: plotting package\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns \n", "# matplotlib Showing the plot right after the current code \n", "%matplotlib inline\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "# basic statistics package\n", "import scipy.stats as stats\n", "from statsmodels.stats.outliers_influence import variance_inflation_factor\n", "import datetime" ] }, { "cell_type": "code", "execution_count": null, "id": "d8983cbf-0937-443f-ba43-d18956ad4a3e", "metadata": {}, "outputs": [], "source": [ "def outlier(dataframe,model,Type='all'):\n", " A = dataframe.copy()\n", " A = A.dropna()\n", " A.index = range(1,A.shape[0]+1)\n", " #A.index = range(0,A.shape[0])\n", " studentized_residuals = model.get_influence().resid_studentized_internal\n", " if Type == 'neg':\n", " return(A[studentized_residuals<-2])\n", " elif Type == 'posi':\n", " return(A[studentized_residuals>2])\n", " else:\n", " return(A[np.abs(studentized_residuals)>2])" ] }, { "cell_type": "code", "execution_count": null, "id": "b47cb1ec-efe7-4124-a653-a9e942ae5e54", "metadata": { "id": "7NIZKrveWDbd" }, "outputs": [], "source": [ "from statsmodels.stats.outliers_influence import variance_inflation_factor\n", "def getvif(X):\n", " X = sm.add_constant(X)\n", " vif = pd.DataFrame()\n", " vif[\"VIF\"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]\n", " vif[\"Predictors\"] = X.columns\n", " return(vif.drop(index = 0).round(2)) " ] }, { "cell_type": "code", "execution_count": null, "id": "5bd86624-4cc8-4790-85d5-826a91ef02aa", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "760c8202-fcaa-4e5f-bbb1-35f972fb4fd6", "metadata": {}, "source": [ "## Getting the data" ] }, { "cell_type": "code", "execution_count": null, "id": "16326102", "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv(\"https://drive.google.com/uc?export=download&id=1r6Za0azxHvJpjUA6lgMJyBqYIWRljmz8\")\n", "data.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "7fd1d118", "metadata": {}, "outputs": [], "source": [ "data.shape" ] }, { "cell_type": "markdown", "id": "9496e740-2f7c-45cc-9409-a029d242a732", "metadata": {}, "source": [ "## Preliminary study" ] }, { "cell_type": "code", "execution_count": null, "id": "52d51f09-1993-47c5-83c8-4d874d0ddaf1", "metadata": {}, "outputs": [], "source": [ "# scatter plot matrix of the whole data\n", "sns_plot = sns.pairplot(data)" ] }, { "cell_type": "markdown", "id": "931e81f6-e2f6-4f07-ab1c-485f598f8a93", "metadata": {}, "source": [ "### Task 1: Compute the correlation matrix; which variable has the strongest correlation with bodyfat? [5pts]" ] }, { "cell_type": "code", "execution_count": null, "id": "6fe53540-95ce-4286-a0b4-0c35437dbe0c", "metadata": {}, "outputs": [], "source": [ "$$$Code_Here$$$" ] }, { "cell_type": "markdown", "id": "7d5d3504-6a3c-486f-afa2-639a4f85e015", "metadata": {}, "source": [ "\n" ] }, { "cell_type": "markdown", "id": "5a155616-4e5b-4992-8a04-467c85ae2977", "metadata": {}, "source": [ "### Answer:" ] }, { "cell_type": "markdown", "id": "7688f2a9-8c5e-4fae-83f3-8b4b1ef7a799", "metadata": {}, "source": [ "### Task 2: Split the data into train and test set by random (use 25 as the random seed/state) [5pts]" ] }, { "cell_type": "code", "execution_count": null, "id": "4ae78c9d-deaa-4998-8928-5437c437e285", "metadata": {}, "outputs": [], "source": [ "train = $$$Code_Here$$$\n", "test = $$$Code_Here$$$\n", "train.shape, test.shape" ] }, { "cell_type": "markdown", "id": "1d82fae0-ce3c-4e4b-92f8-69657b421f74", "metadata": {}, "source": [ "## Model Building" ] }, { "cell_type": "markdown", "id": "e5d20382-fd4c-4030-9d43-677f267b447b", "metadata": {}, "source": [ "### Task 3: Using the train set, fit a simple regression model for bodyfat by the variable from Task1 (with max correlation) [5pts]" ] }, { "cell_type": "code", "execution_count": null, "id": "4994daf5-b40b-41b3-93fd-ab3f90568c91", "metadata": {}, "outputs": [], "source": [ "Y = train[\"bodyfat\"]\n", "X = $$$Code_Here$$$\n", "SLR = $$$Code_Here$$$\n", "print(SLR.summary())" ] }, { "cell_type": "markdown", "id": "8ba21618-7c3a-4baf-861f-cbdbb7cc7f05", "metadata": {}, "source": [ "### Task 4: Using the trian set, fit a multiple regression model for bodyfat by all of the explanatory variables (i.e. age, weight, height and bmi) [5pts]" ] }, { "cell_type": "code", "execution_count": null, "id": "500ee96e-c8db-4053-ad0b-4097ae291696", "metadata": {}, "outputs": [], "source": [ "Y = train[\"bodyfat\"]\n", "X = $$$Code_Here$$$\n", "MLR_all = $$$Code_Here$$$\n", "print(MLR_all.summary())" ] }, { "cell_type": "code", "execution_count": null, "id": "af58180f-45a0-4cc9-b042-3845996d683e", "metadata": {}, "outputs": [], "source": [ "# VIF of all predictors\n", "getvif(X)" ] }, { "cell_type": "markdown", "id": "c2cebb64-f6e0-41b6-96be-09f218ba05a3", "metadata": {}, "source": [ "### Task 5: Multicollinearity is reflected from above. Does bmi cause the problem? Briefly explain using your understanding about bmi.[5pts]" ] }, { "cell_type": "markdown", "id": "87def1eb-ba5c-49a9-bfdf-48b3af5e1a81", "metadata": {}, "source": [ "### Answer: " ] }, { "cell_type": "markdown", "id": "dc60e40f-72b8-43a5-8b70-40fec8cc4f4d", "metadata": {}, "source": [ "### Task 6: Using the train set, fit a multiple regression model for bodyfat by all of the explanatory variables except bmi [5pts]" ] }, { "cell_type": "code", "execution_count": null, "id": "73499ab4-b937-4b30-a647-aeedc20d929e", "metadata": {}, "outputs": [], "source": [ "Y = train[\"bodyfat\"]\n", "X = $$$Code_Here$$$\n", "MLR = $$$Code_Here$$$\n", "print(MLR.summary())" ] }, { "cell_type": "code", "execution_count": null, "id": "883671ff-79ce-4c80-941c-957b0c8bcde2", "metadata": {}, "outputs": [], "source": [ "getvif(X)" ] }, { "cell_type": "markdown", "id": "742ef24b-3b03-439d-9d18-47c1626efe05", "metadata": {}, "source": [ "## Compare the predictive power between SLR and MLR using the test set. " ] }, { "cell_type": "markdown", "id": "708e9eed-0b88-49d1-93a1-fba94254b597", "metadata": {}, "source": [ "### Task 7: Compute the RMSE for MLR [5pts]" ] }, { "cell_type": "code", "execution_count": null, "id": "469672cd-2db3-49eb-b137-c6d63033a55f", "metadata": {}, "outputs": [], "source": [ "Test_X_SLR = test['bmi']\n", "Test_X_MLR = $$$Code_Here$$$\n", "\n", "Test_Y_SLR = SLR.predict(sm.add_constant(Test_X_SLR))\n", "Test_Y_MLR = $$$Code_Here$$$" ] }, { "cell_type": "code", "execution_count": null, "id": "f077e2cf-4cad-4578-90bb-a63c32ddda18", "metadata": {}, "outputs": [], "source": [ "Test_Y = test[\"bodyfat\"]\n", "\n", "from sklearn.metrics import mean_squared_error\n", "rmse_SLR = np.sqrt(mean_squared_error(Test_Y, Test_Y_SLR))\n", "rmse_MLR = $$$Code_Here$$$\n", "print(\"RMSE for test set (SLR): \", rmse_SLR)\n", "print(\"RMSE for test set (MLR): \", rmse_MLR)" ] }, { "cell_type": "markdown", "id": "1659a34e-8080-4e92-be34-717acf81c82c", "metadata": {}, "source": [ "## Final Model and application" ] }, { "cell_type": "markdown", "id": "89256526-a134-4e7d-9a84-1f7dc331bfc2", "metadata": {}, "source": [ "### Task 8: Refit the better model from Task7 using the full dataset. [5pts]" ] }, { "cell_type": "code", "execution_count": null, "id": "84da494e-36bf-4f96-9376-3fb43e8df704", "metadata": {}, "outputs": [], "source": [ "Y = data[\"bodyfat\"]\n", "X = $$$Code_Here$$$\n", "Final = $$$Code_Here$$$\n", "print(Final.summary())" ] }, { "cell_type": "markdown", "id": "539487cb-0e72-4a2b-85cd-6dbac1d2e321", "metadata": { "tags": [] }, "source": [ "### Task 9: Predict the body fat for a new patient using the final model from Task 8: age=35, weight=170 pounds. Height=72 inches, BMI=23 [5pts]" ] }, { "cell_type": "code", "execution_count": null, "id": "71aacafa-0cf1-40ed-8ee9-d5d09ebf26ab", "metadata": {}, "outputs": [], "source": [ "$$$Code_Here$$$" ] }, { "cell_type": "markdown", "id": "37f640ce-dfd6-42c9-90d5-13b05769448f", "metadata": { "tags": [] }, "source": [ "### Task 10: by using residuals (from final model) outlier analysis, report those patients are in danger (i.e. % of fat is much higher than what we expected from the final model) [5pts]" ] }, { "cell_type": "code", "execution_count": null, "id": "f06aa2bd-9836-46c2-8e8d-4a876c9ae253", "metadata": {}, "outputs": [], "source": [ "$$$Code_Here$$$" ] }, { "cell_type": "code", "execution_count": null, "id": "149424ee-f928-479b-8b07-19c3ae88f751", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.9" } }, "nbformat": 4, "nbformat_minor": 5 }