Files
anit961/task1/Assignment2_template.ipynb
louiscklaw d71bfac9e2 update,
2025-02-01 01:59:53 +08:00

468 lines
12 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "c9eac7d0-4f4c-49e7-a335-b8ca8de34f20",
"metadata": {},
"source": [
"# Assignment 2:\n",
"In this assignment, you are given a dataset about bodyfat measure collected from a clinic. As a nutritionist, you want to use regression model to check whether your patients are in danger of high bodyfat.\n",
"The data contains physical measurements of patients:\n",
"- explantory variables: Age (in year), Weight (in pounds), Height (in inches), and BMI (Body Mass Index).\n",
"- response variable: bodyfat (in %)\n",
"\n",
"Complete the given tasks below. (Notice each of the missing code to be filled is a single line command, more than one command line will be downgraded)"
]
},
{
"cell_type": "markdown",
"id": "0f79e7b9-1fff-4bef-9fe8-8e3eb54c68cb",
"metadata": {
"tags": []
},
"source": [
"<h1 style=\"color:purple\">Author</h1>\n",
"\n",
" \n",
"- Name: \n",
"- Student ID: "
]
},
{
"cell_type": "markdown",
"id": "d4725f61-cb83-4e3c-bc31-1a3992dadd28",
"metadata": {},
"source": [
"## Pre-loaded packages and functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d7e90f45",
"metadata": {},
"outputs": [],
"source": [
"#### Pandas is for using data structures\n",
"import pandas as pd\n",
"# statsmodels contain modules for regression and time series analysis\n",
"import statsmodels.api as sm\n",
"# numpy is for numerical computing of array and matrix\n",
"import numpy as np\n",
"# Matplotlib, Seaborn: plotting package\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns \n",
"# matplotlib Showing the plot right after the current code \n",
"%matplotlib inline\n",
"import warnings\n",
"warnings.filterwarnings('ignore')\n",
"# basic statistics package\n",
"import scipy.stats as stats\n",
"from statsmodels.stats.outliers_influence import variance_inflation_factor\n",
"import datetime"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d8983cbf-0937-443f-ba43-d18956ad4a3e",
"metadata": {},
"outputs": [],
"source": [
"def outlier(dataframe,model,Type='all'):\n",
" A = dataframe.copy()\n",
" A = A.dropna()\n",
" A.index = range(1,A.shape[0]+1)\n",
" #A.index = range(0,A.shape[0])\n",
" studentized_residuals = model.get_influence().resid_studentized_internal\n",
" if Type == 'neg':\n",
" return(A[studentized_residuals<-2])\n",
" elif Type == 'posi':\n",
" return(A[studentized_residuals>2])\n",
" else:\n",
" return(A[np.abs(studentized_residuals)>2])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b47cb1ec-efe7-4124-a653-a9e942ae5e54",
"metadata": {
"id": "7NIZKrveWDbd"
},
"outputs": [],
"source": [
"from statsmodels.stats.outliers_influence import variance_inflation_factor\n",
"def getvif(X):\n",
" X = sm.add_constant(X)\n",
" vif = pd.DataFrame()\n",
" vif[\"VIF\"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]\n",
" vif[\"Predictors\"] = X.columns\n",
" return(vif.drop(index = 0).round(2)) "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5bd86624-4cc8-4790-85d5-826a91ef02aa",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "760c8202-fcaa-4e5f-bbb1-35f972fb4fd6",
"metadata": {},
"source": [
"## Getting the data"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "16326102",
"metadata": {},
"outputs": [],
"source": [
"data = pd.read_csv(\"https://drive.google.com/uc?export=download&id=1r6Za0azxHvJpjUA6lgMJyBqYIWRljmz8\")\n",
"data.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7fd1d118",
"metadata": {},
"outputs": [],
"source": [
"data.shape"
]
},
{
"cell_type": "markdown",
"id": "9496e740-2f7c-45cc-9409-a029d242a732",
"metadata": {},
"source": [
"## Preliminary study"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "52d51f09-1993-47c5-83c8-4d874d0ddaf1",
"metadata": {},
"outputs": [],
"source": [
"# scatter plot matrix of the whole data\n",
"sns_plot = sns.pairplot(data)"
]
},
{
"cell_type": "markdown",
"id": "931e81f6-e2f6-4f07-ab1c-485f598f8a93",
"metadata": {},
"source": [
"### Task 1: Compute the correlation matrix; which variable has the strongest correlation with bodyfat? [5pts]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6fe53540-95ce-4286-a0b4-0c35437dbe0c",
"metadata": {},
"outputs": [],
"source": [
"$$$Code_Here$$$"
]
},
{
"cell_type": "markdown",
"id": "7d5d3504-6a3c-486f-afa2-639a4f85e015",
"metadata": {},
"source": [
"\n"
]
},
{
"cell_type": "markdown",
"id": "5a155616-4e5b-4992-8a04-467c85ae2977",
"metadata": {},
"source": [
"### Answer:"
]
},
{
"cell_type": "markdown",
"id": "7688f2a9-8c5e-4fae-83f3-8b4b1ef7a799",
"metadata": {},
"source": [
"### Task 2: Split the data into train and test set by random (use 25 as the random seed/state) [5pts]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4ae78c9d-deaa-4998-8928-5437c437e285",
"metadata": {},
"outputs": [],
"source": [
"train = $$$Code_Here$$$\n",
"test = $$$Code_Here$$$\n",
"train.shape, test.shape"
]
},
{
"cell_type": "markdown",
"id": "1d82fae0-ce3c-4e4b-92f8-69657b421f74",
"metadata": {},
"source": [
"## Model Building"
]
},
{
"cell_type": "markdown",
"id": "e5d20382-fd4c-4030-9d43-677f267b447b",
"metadata": {},
"source": [
"### Task 3: Using the train set, fit a simple regression model for bodyfat by the variable from Task1 (with max correlation) [5pts]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4994daf5-b40b-41b3-93fd-ab3f90568c91",
"metadata": {},
"outputs": [],
"source": [
"Y = train[\"bodyfat\"]\n",
"X = $$$Code_Here$$$\n",
"SLR = $$$Code_Here$$$\n",
"print(SLR.summary())"
]
},
{
"cell_type": "markdown",
"id": "8ba21618-7c3a-4baf-861f-cbdbb7cc7f05",
"metadata": {},
"source": [
"### Task 4: Using the trian set, fit a multiple regression model for bodyfat by all of the explanatory variables (i.e. age, weight, height and bmi) [5pts]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "500ee96e-c8db-4053-ad0b-4097ae291696",
"metadata": {},
"outputs": [],
"source": [
"Y = train[\"bodyfat\"]\n",
"X = $$$Code_Here$$$\n",
"MLR_all = $$$Code_Here$$$\n",
"print(MLR_all.summary())"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "af58180f-45a0-4cc9-b042-3845996d683e",
"metadata": {},
"outputs": [],
"source": [
"# VIF of all predictors\n",
"getvif(X)"
]
},
{
"cell_type": "markdown",
"id": "c2cebb64-f6e0-41b6-96be-09f218ba05a3",
"metadata": {},
"source": [
"### Task 5: Multicollinearity is reflected from above. Does bmi cause the problem? Briefly explain using your understanding about bmi.[5pts]"
]
},
{
"cell_type": "markdown",
"id": "87def1eb-ba5c-49a9-bfdf-48b3af5e1a81",
"metadata": {},
"source": [
"### Answer: "
]
},
{
"cell_type": "markdown",
"id": "dc60e40f-72b8-43a5-8b70-40fec8cc4f4d",
"metadata": {},
"source": [
"### Task 6: Using the train set, fit a multiple regression model for bodyfat by all of the explanatory variables except bmi [5pts]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "73499ab4-b937-4b30-a647-aeedc20d929e",
"metadata": {},
"outputs": [],
"source": [
"Y = train[\"bodyfat\"]\n",
"X = $$$Code_Here$$$\n",
"MLR = $$$Code_Here$$$\n",
"print(MLR.summary())"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "883671ff-79ce-4c80-941c-957b0c8bcde2",
"metadata": {},
"outputs": [],
"source": [
"getvif(X)"
]
},
{
"cell_type": "markdown",
"id": "742ef24b-3b03-439d-9d18-47c1626efe05",
"metadata": {},
"source": [
"## Compare the predictive power between SLR and MLR using the test set. "
]
},
{
"cell_type": "markdown",
"id": "708e9eed-0b88-49d1-93a1-fba94254b597",
"metadata": {},
"source": [
"### Task 7: Compute the RMSE for MLR [5pts]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "469672cd-2db3-49eb-b137-c6d63033a55f",
"metadata": {},
"outputs": [],
"source": [
"Test_X_SLR = test['bmi']\n",
"Test_X_MLR = $$$Code_Here$$$\n",
"\n",
"Test_Y_SLR = SLR.predict(sm.add_constant(Test_X_SLR))\n",
"Test_Y_MLR = $$$Code_Here$$$"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f077e2cf-4cad-4578-90bb-a63c32ddda18",
"metadata": {},
"outputs": [],
"source": [
"Test_Y = test[\"bodyfat\"]\n",
"\n",
"from sklearn.metrics import mean_squared_error\n",
"rmse_SLR = np.sqrt(mean_squared_error(Test_Y, Test_Y_SLR))\n",
"rmse_MLR = $$$Code_Here$$$\n",
"print(\"RMSE for test set (SLR): \", rmse_SLR)\n",
"print(\"RMSE for test set (MLR): \", rmse_MLR)"
]
},
{
"cell_type": "markdown",
"id": "1659a34e-8080-4e92-be34-717acf81c82c",
"metadata": {},
"source": [
"## Final Model and application"
]
},
{
"cell_type": "markdown",
"id": "89256526-a134-4e7d-9a84-1f7dc331bfc2",
"metadata": {},
"source": [
"### Task 8: Refit the better model from Task7 using the full dataset. [5pts]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "84da494e-36bf-4f96-9376-3fb43e8df704",
"metadata": {},
"outputs": [],
"source": [
"Y = data[\"bodyfat\"]\n",
"X = $$$Code_Here$$$\n",
"Final = $$$Code_Here$$$\n",
"print(Final.summary())"
]
},
{
"cell_type": "markdown",
"id": "539487cb-0e72-4a2b-85cd-6dbac1d2e321",
"metadata": {
"tags": []
},
"source": [
"### Task 9: Predict the body fat for a new patient using the final model from Task 8: age=35, weight=170 pounds. Height=72 inches, BMI=23 [5pts]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "71aacafa-0cf1-40ed-8ee9-d5d09ebf26ab",
"metadata": {},
"outputs": [],
"source": [
"$$$Code_Here$$$"
]
},
{
"cell_type": "markdown",
"id": "37f640ce-dfd6-42c9-90d5-13b05769448f",
"metadata": {
"tags": []
},
"source": [
"### Task 10: by using residuals (from final model) outlier analysis, report those patients are in danger (i.e. % of fat is much higher than what we expected from the final model) [5pts]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f06aa2bd-9836-46c2-8e8d-4a876c9ae253",
"metadata": {},
"outputs": [],
"source": [
"$$$Code_Here$$$"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "149424ee-f928-479b-8b07-19c3ae88f751",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}