anit961/task1/Assignment2_template.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c9eac7d0-4f4c-49e7-a335-b8ca8de34f20",
   "metadata": {},
   "source": [
    "# Assignment 2:\n",
    "In this assignment, you are given a dataset about bodyfat measure collected from a clinic. As a nutritionist, you want to use regression model to check whether your patients are in danger of high bodyfat.\n",
    "The data contains physical measurements of patients:\n",
    "- explantory variables: Age (in year), Weight (in pounds), Height (in inches), and BMI (Body Mass Index).\n",
    "- response variable: bodyfat (in %)\n",
    "\n",
    "Complete the given tasks below. (Notice each of the missing code to be filled is a single line command, more than one command line will be downgraded)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0f79e7b9-1fff-4bef-9fe8-8e3eb54c68cb",
   "metadata": {
    "tags": []
   },
   "source": [
    "<h1 style=\"color:purple\">Author</h1>\n",
    "\n",
    "  \n",
    "- Name:                      \n",
    "- Student ID: "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d4725f61-cb83-4e3c-bc31-1a3992dadd28",
   "metadata": {},
   "source": [
    "## Pre-loaded packages and functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d7e90f45",
   "metadata": {},
   "outputs": [],
   "source": [
    "#### Pandas is for using data structures\n",
    "import pandas as pd\n",
    "# statsmodels contain modules for regression and time series analysis\n",
    "import statsmodels.api as sm\n",
    "# numpy is for numerical computing of array and matrix\n",
    "import numpy as np\n",
    "# Matplotlib, Seaborn: plotting package\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns \n",
    "# matplotlib Showing the plot right after the current code  \n",
    "%matplotlib inline\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "# basic statistics package\n",
    "import scipy.stats as stats\n",
    "from statsmodels.stats.outliers_influence import variance_inflation_factor\n",
    "import datetime"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d8983cbf-0937-443f-ba43-d18956ad4a3e",
   "metadata": {},
   "outputs": [],
   "source": [
    "def outlier(dataframe,model,Type='all'):\n",
    "    A = dataframe.copy()\n",
    "    A = A.dropna()\n",
    "    A.index = range(1,A.shape[0]+1)\n",
    "    #A.index = range(0,A.shape[0])\n",
    "    studentized_residuals = model.get_influence().resid_studentized_internal\n",
    "    if Type == 'neg':\n",
    "        return(A[studentized_residuals<-2])\n",
    "    elif Type == 'posi':\n",
    "        return(A[studentized_residuals>2])\n",
    "    else:\n",
    "        return(A[np.abs(studentized_residuals)>2])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b47cb1ec-efe7-4124-a653-a9e942ae5e54",
   "metadata": {
    "id": "7NIZKrveWDbd"
   },
   "outputs": [],
   "source": [
    "from statsmodels.stats.outliers_influence import variance_inflation_factor\n",
    "def getvif(X):\n",
    "    X = sm.add_constant(X)\n",
    "    vif = pd.DataFrame()\n",
    "    vif[\"VIF\"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]\n",
    "    vif[\"Predictors\"] = X.columns\n",
    "    return(vif.drop(index = 0).round(2)) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5bd86624-4cc8-4790-85d5-826a91ef02aa",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "760c8202-fcaa-4e5f-bbb1-35f972fb4fd6",
   "metadata": {},
   "source": [
    "## Getting the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16326102",
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.read_csv(\"https://drive.google.com/uc?export=download&id=1r6Za0azxHvJpjUA6lgMJyBqYIWRljmz8\")\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7fd1d118",
   "metadata": {},
   "outputs": [],
   "source": [
    "data.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9496e740-2f7c-45cc-9409-a029d242a732",
   "metadata": {},
   "source": [
    "## Preliminary study"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "52d51f09-1993-47c5-83c8-4d874d0ddaf1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# scatter plot matrix of the whole data\n",
    "sns_plot = sns.pairplot(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "931e81f6-e2f6-4f07-ab1c-485f598f8a93",
   "metadata": {},
   "source": [
    "### Task 1: Compute the correlation matrix; which variable has the strongest correlation with bodyfat? [5pts]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6fe53540-95ce-4286-a0b4-0c35437dbe0c",
   "metadata": {},
   "outputs": [],
   "source": [
    "$$$Code_Here$$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d5d3504-6a3c-486f-afa2-639a4f85e015",
   "metadata": {},
   "source": [
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a155616-4e5b-4992-8a04-467c85ae2977",
   "metadata": {},
   "source": [
    "### Answer:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7688f2a9-8c5e-4fae-83f3-8b4b1ef7a799",
   "metadata": {},
   "source": [
    "### Task 2: Split the data into train and test set by random (use 25 as the random seed/state) [5pts]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4ae78c9d-deaa-4998-8928-5437c437e285",
   "metadata": {},
   "outputs": [],
   "source": [
    "train = $$$Code_Here$$$\n",
    "test = $$$Code_Here$$$\n",
    "train.shape, test.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d82fae0-ce3c-4e4b-92f8-69657b421f74",
   "metadata": {},
   "source": [
    "## Model Building"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5d20382-fd4c-4030-9d43-677f267b447b",
   "metadata": {},
   "source": [
    "### Task 3: Using the train set, fit a simple regression model for bodyfat by the variable from Task1 (with max correlation) [5pts]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4994daf5-b40b-41b3-93fd-ab3f90568c91",
   "metadata": {},
   "outputs": [],
   "source": [
    "Y = train[\"bodyfat\"]\n",
    "X = $$$Code_Here$$$\n",
    "SLR = $$$Code_Here$$$\n",
    "print(SLR.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8ba21618-7c3a-4baf-861f-cbdbb7cc7f05",
   "metadata": {},
   "source": [
    "### Task 4: Using the trian set, fit a multiple regression model for bodyfat by all of the explanatory variables (i.e. age, weight, height and bmi) [5pts]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "500ee96e-c8db-4053-ad0b-4097ae291696",
   "metadata": {},
   "outputs": [],
   "source": [
    "Y = train[\"bodyfat\"]\n",
    "X = $$$Code_Here$$$\n",
    "MLR_all = $$$Code_Here$$$\n",
    "print(MLR_all.summary())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "af58180f-45a0-4cc9-b042-3845996d683e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# VIF of all predictors\n",
    "getvif(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2cebb64-f6e0-41b6-96be-09f218ba05a3",
   "metadata": {},
   "source": [
    "### Task 5: Multicollinearity is reflected from above. Does bmi cause the problem? Briefly explain using your understanding about bmi.[5pts]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "87def1eb-ba5c-49a9-bfdf-48b3af5e1a81",
   "metadata": {},
   "source": [
    "### Answer: "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc60e40f-72b8-43a5-8b70-40fec8cc4f4d",
   "metadata": {},
   "source": [
    "### Task 6: Using the train set, fit a multiple regression model for bodyfat by all of the explanatory variables except bmi [5pts]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "73499ab4-b937-4b30-a647-aeedc20d929e",
   "metadata": {},
   "outputs": [],
   "source": [
    "Y = train[\"bodyfat\"]\n",
    "X = $$$Code_Here$$$\n",
    "MLR = $$$Code_Here$$$\n",
    "print(MLR.summary())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "883671ff-79ce-4c80-941c-957b0c8bcde2",
   "metadata": {},
   "outputs": [],
   "source": [
    "getvif(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "742ef24b-3b03-439d-9d18-47c1626efe05",
   "metadata": {},
   "source": [
    "## Compare the predictive power between SLR and MLR using the test set. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "708e9eed-0b88-49d1-93a1-fba94254b597",
   "metadata": {},
   "source": [
    "### Task 7: Compute the RMSE for MLR [5pts]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "469672cd-2db3-49eb-b137-c6d63033a55f",
   "metadata": {},
   "outputs": [],
   "source": [
    "Test_X_SLR = test['bmi']\n",
    "Test_X_MLR = $$$Code_Here$$$\n",
    "\n",
    "Test_Y_SLR = SLR.predict(sm.add_constant(Test_X_SLR))\n",
    "Test_Y_MLR = $$$Code_Here$$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f077e2cf-4cad-4578-90bb-a63c32ddda18",
   "metadata": {},
   "outputs": [],
   "source": [
    "Test_Y = test[\"bodyfat\"]\n",
    "\n",
    "from sklearn.metrics import mean_squared_error\n",
    "rmse_SLR = np.sqrt(mean_squared_error(Test_Y, Test_Y_SLR))\n",
    "rmse_MLR = $$$Code_Here$$$\n",
    "print(\"RMSE for test set (SLR): \", rmse_SLR)\n",
    "print(\"RMSE for test set (MLR): \", rmse_MLR)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1659a34e-8080-4e92-be34-717acf81c82c",
   "metadata": {},
   "source": [
    "## Final Model and application"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89256526-a134-4e7d-9a84-1f7dc331bfc2",
   "metadata": {},
   "source": [
    "### Task 8: Refit the better model from Task7 using the full dataset. [5pts]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "84da494e-36bf-4f96-9376-3fb43e8df704",
   "metadata": {},
   "outputs": [],
   "source": [
    "Y = data[\"bodyfat\"]\n",
    "X = $$$Code_Here$$$\n",
    "Final = $$$Code_Here$$$\n",
    "print(Final.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "539487cb-0e72-4a2b-85cd-6dbac1d2e321",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Task 9:  Predict the body fat for a new patient using the final model from Task 8: age=35, weight=170 pounds. Height=72 inches, BMI=23 [5pts]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "71aacafa-0cf1-40ed-8ee9-d5d09ebf26ab",
   "metadata": {},
   "outputs": [],
   "source": [
    "$$$Code_Here$$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "37f640ce-dfd6-42c9-90d5-13b05769448f",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Task 10: by using residuals (from final model) outlier analysis, report those patients are in danger (i.e. % of fat is much higher than what we expected from the final model) [5pts]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f06aa2bd-9836-46c2-8e8d-4a876c9ae253",
   "metadata": {},
   "outputs": [],
   "source": [
    "$$$Code_Here$$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "149424ee-f928-479b-8b07-19c3ae88f751",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}