{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Supervised Learning\n", "## KNN\n", "> - KNN is a non-parametric learning algorithm (No assumption is made on the data) \n", "> - KNN can be used for classification (discrte) and regression (continuous label) \n", "> - All training data has to be present to determine the label of new data \n", "> - Sensitive to irrelavant features \n", "> - Sensitive to scale of data \n", "### Issues:\n", "> - Choose number of neighors *k* \n", "> - Choose distance metric " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data\n", "### Breast Cancer\n", "> **Label**: Malignant or Benign \n", "> **30 Features**: Radius, Texture, Perimeter, Area, Smoothness, etc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean radiusmean texturemean perimetermean areamean smoothnessmean compactnessmean concavitymean concave pointsmean symmetrymean fractal dimension...worst radiusworst textureworst perimeterworst areaworst smoothnessworst compactnessworst concavityworst concave pointsworst symmetryworst fractal dimension
017.9910.38122.801001.00.118400.277600.30010.147100.24190.07871...25.3817.33184.602019.00.16220.66560.71190.26540.46010.11890
120.5717.77132.901326.00.084740.078640.08690.070170.18120.05667...24.9923.41158.801956.00.12380.18660.24160.18600.27500.08902
219.6921.25130.001203.00.109600.159900.19740.127900.20690.05999...23.5725.53152.501709.00.14440.42450.45040.24300.36130.08758
311.4220.3877.58386.10.142500.283900.24140.105200.25970.09744...14.9126.5098.87567.70.20980.86630.68690.25750.66380.17300
420.2914.34135.101297.00.100300.132800.19800.104300.18090.05883...22.5416.67152.201575.00.13740.20500.40000.16250.23640.07678
\n", "

5 rows × 30 columns

\n", "
" ], "text/plain": [ " mean radius mean texture mean perimeter mean area mean smoothness \\\n", "0 17.99 10.38 122.80 1001.0 0.11840 \n", "1 20.57 17.77 132.90 1326.0 0.08474 \n", "2 19.69 21.25 130.00 1203.0 0.10960 \n", "3 11.42 20.38 77.58 386.1 0.14250 \n", "4 20.29 14.34 135.10 1297.0 0.10030 \n", "\n", " mean compactness mean concavity mean concave points mean symmetry \\\n", "0 0.27760 0.3001 0.14710 0.2419 \n", "1 0.07864 0.0869 0.07017 0.1812 \n", "2 0.15990 0.1974 0.12790 0.2069 \n", "3 0.28390 0.2414 0.10520 0.2597 \n", "4 0.13280 0.1980 0.10430 0.1809 \n", "\n", " mean fractal dimension ... worst radius worst texture worst perimeter \\\n", "0 0.07871 ... 25.38 17.33 184.60 \n", "1 0.05667 ... 24.99 23.41 158.80 \n", "2 0.05999 ... 23.57 25.53 152.50 \n", "3 0.09744 ... 14.91 26.50 98.87 \n", "4 0.05883 ... 22.54 16.67 152.20 \n", "\n", " worst area worst smoothness worst compactness worst concavity \\\n", "0 2019.0 0.1622 0.6656 0.7119 \n", "1 1956.0 0.1238 0.1866 0.2416 \n", "2 1709.0 0.1444 0.4245 0.4504 \n", "3 567.7 0.2098 0.8663 0.6869 \n", "4 1575.0 0.1374 0.2050 0.4000 \n", "\n", " worst concave points worst symmetry worst fractal dimension \n", "0 0.2654 0.4601 0.11890 \n", "1 0.1860 0.2750 0.08902 \n", "2 0.2430 0.3613 0.08758 \n", "3 0.2575 0.6638 0.17300 \n", "4 0.1625 0.2364 0.07678 \n", "\n", "[5 rows x 30 columns]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "from sklearn.datasets import load_breast_cancer\n", "\n", "breast_cancer = load_breast_cancer()\n", "X = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)\n", "X.head()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Create an X with 2 features only\n", "\n", "X = X[['mean area', 'mean compactness']]" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['malignant', 'benign'], dtype='\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
benign
00
10
20
30
40
\n", "" ], "text/plain": [ " benign\n", "0 0\n", "1 0\n", "2 0\n", "3 0\n", "4 0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y = pd.get_dummies(y, drop_first=True) \n", "y.head()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(426, 1)\n", "(143, 1)\n", "\n" ] } ], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=1)\n", "print(ytrain.shape)\n", "print(ytest.shape)\n", "print(type(ytest))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "import numpy as np\n", "\n", "knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')\n", "knn.fit(Xtrain, ytrain.to_numpy().ravel())\n", "ypred = knn.predict(Xtest).reshape([143,1])" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PredictionActual
011
110
211
300
400
\n", "
" ], "text/plain": [ " Prediction Actual\n", "0 1 1\n", "1 1 0\n", "2 1 1\n", "3 0 0\n", "4 0 0" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(np.hstack([ypred,ytest]),columns=['Prediction','Actual']).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluation" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[42 13]\n", " [ 9 79]]\n" ] } ], "source": [ "from sklearn.metrics import confusion_matrix\n", "print(confusion_matrix(ytest, ypred))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0" } }, "nbformat": 4, "nbformat_minor": 2 }