{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Supervised Learning\n",
"## KNN\n",
"> - KNN is a non-parametric learning algorithm (No assumption is made on the data) \n",
"> - KNN can be used for classification (discrte) and regression (continuous label) \n",
"> - All training data has to be present to determine the label of new data \n",
"> - Sensitive to irrelavant features \n",
"> - Sensitive to scale of data \n",
"### Issues:\n",
"> - Choose number of neighors *k* \n",
"> - Choose distance metric "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data\n",
"### Breast Cancer\n",
"> **Label**: Malignant or Benign \n",
"> **30 Features**: Radius, Texture, Perimeter, Area, Smoothness, etc"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" mean radius | \n",
" mean texture | \n",
" mean perimeter | \n",
" mean area | \n",
" mean smoothness | \n",
" mean compactness | \n",
" mean concavity | \n",
" mean concave points | \n",
" mean symmetry | \n",
" mean fractal dimension | \n",
" ... | \n",
" worst radius | \n",
" worst texture | \n",
" worst perimeter | \n",
" worst area | \n",
" worst smoothness | \n",
" worst compactness | \n",
" worst concavity | \n",
" worst concave points | \n",
" worst symmetry | \n",
" worst fractal dimension | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 17.99 | \n",
" 10.38 | \n",
" 122.80 | \n",
" 1001.0 | \n",
" 0.11840 | \n",
" 0.27760 | \n",
" 0.3001 | \n",
" 0.14710 | \n",
" 0.2419 | \n",
" 0.07871 | \n",
" ... | \n",
" 25.38 | \n",
" 17.33 | \n",
" 184.60 | \n",
" 2019.0 | \n",
" 0.1622 | \n",
" 0.6656 | \n",
" 0.7119 | \n",
" 0.2654 | \n",
" 0.4601 | \n",
" 0.11890 | \n",
"
\n",
" \n",
" 1 | \n",
" 20.57 | \n",
" 17.77 | \n",
" 132.90 | \n",
" 1326.0 | \n",
" 0.08474 | \n",
" 0.07864 | \n",
" 0.0869 | \n",
" 0.07017 | \n",
" 0.1812 | \n",
" 0.05667 | \n",
" ... | \n",
" 24.99 | \n",
" 23.41 | \n",
" 158.80 | \n",
" 1956.0 | \n",
" 0.1238 | \n",
" 0.1866 | \n",
" 0.2416 | \n",
" 0.1860 | \n",
" 0.2750 | \n",
" 0.08902 | \n",
"
\n",
" \n",
" 2 | \n",
" 19.69 | \n",
" 21.25 | \n",
" 130.00 | \n",
" 1203.0 | \n",
" 0.10960 | \n",
" 0.15990 | \n",
" 0.1974 | \n",
" 0.12790 | \n",
" 0.2069 | \n",
" 0.05999 | \n",
" ... | \n",
" 23.57 | \n",
" 25.53 | \n",
" 152.50 | \n",
" 1709.0 | \n",
" 0.1444 | \n",
" 0.4245 | \n",
" 0.4504 | \n",
" 0.2430 | \n",
" 0.3613 | \n",
" 0.08758 | \n",
"
\n",
" \n",
" 3 | \n",
" 11.42 | \n",
" 20.38 | \n",
" 77.58 | \n",
" 386.1 | \n",
" 0.14250 | \n",
" 0.28390 | \n",
" 0.2414 | \n",
" 0.10520 | \n",
" 0.2597 | \n",
" 0.09744 | \n",
" ... | \n",
" 14.91 | \n",
" 26.50 | \n",
" 98.87 | \n",
" 567.7 | \n",
" 0.2098 | \n",
" 0.8663 | \n",
" 0.6869 | \n",
" 0.2575 | \n",
" 0.6638 | \n",
" 0.17300 | \n",
"
\n",
" \n",
" 4 | \n",
" 20.29 | \n",
" 14.34 | \n",
" 135.10 | \n",
" 1297.0 | \n",
" 0.10030 | \n",
" 0.13280 | \n",
" 0.1980 | \n",
" 0.10430 | \n",
" 0.1809 | \n",
" 0.05883 | \n",
" ... | \n",
" 22.54 | \n",
" 16.67 | \n",
" 152.20 | \n",
" 1575.0 | \n",
" 0.1374 | \n",
" 0.2050 | \n",
" 0.4000 | \n",
" 0.1625 | \n",
" 0.2364 | \n",
" 0.07678 | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 30 columns
\n",
"
"
],
"text/plain": [
" mean radius mean texture mean perimeter mean area mean smoothness \\\n",
"0 17.99 10.38 122.80 1001.0 0.11840 \n",
"1 20.57 17.77 132.90 1326.0 0.08474 \n",
"2 19.69 21.25 130.00 1203.0 0.10960 \n",
"3 11.42 20.38 77.58 386.1 0.14250 \n",
"4 20.29 14.34 135.10 1297.0 0.10030 \n",
"\n",
" mean compactness mean concavity mean concave points mean symmetry \\\n",
"0 0.27760 0.3001 0.14710 0.2419 \n",
"1 0.07864 0.0869 0.07017 0.1812 \n",
"2 0.15990 0.1974 0.12790 0.2069 \n",
"3 0.28390 0.2414 0.10520 0.2597 \n",
"4 0.13280 0.1980 0.10430 0.1809 \n",
"\n",
" mean fractal dimension ... worst radius worst texture worst perimeter \\\n",
"0 0.07871 ... 25.38 17.33 184.60 \n",
"1 0.05667 ... 24.99 23.41 158.80 \n",
"2 0.05999 ... 23.57 25.53 152.50 \n",
"3 0.09744 ... 14.91 26.50 98.87 \n",
"4 0.05883 ... 22.54 16.67 152.20 \n",
"\n",
" worst area worst smoothness worst compactness worst concavity \\\n",
"0 2019.0 0.1622 0.6656 0.7119 \n",
"1 1956.0 0.1238 0.1866 0.2416 \n",
"2 1709.0 0.1444 0.4245 0.4504 \n",
"3 567.7 0.2098 0.8663 0.6869 \n",
"4 1575.0 0.1374 0.2050 0.4000 \n",
"\n",
" worst concave points worst symmetry worst fractal dimension \n",
"0 0.2654 0.4601 0.11890 \n",
"1 0.1860 0.2750 0.08902 \n",
"2 0.2430 0.3613 0.08758 \n",
"3 0.2575 0.6638 0.17300 \n",
"4 0.1625 0.2364 0.07678 \n",
"\n",
"[5 rows x 30 columns]"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"from sklearn.datasets import load_breast_cancer\n",
"\n",
"breast_cancer = load_breast_cancer()\n",
"X = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)\n",
"X.head()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Create an X with 2 features only\n",
"\n",
"X = X[['mean area', 'mean compactness']]"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['malignant', 'benign'], dtype='\n",
"\n",
"\n",
" \n",
" \n",
" | \n",
" benign | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
""
],
"text/plain": [
" benign\n",
"0 0\n",
"1 0\n",
"2 0\n",
"3 0\n",
"4 0"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y = pd.get_dummies(y, drop_first=True) \n",
"y.head()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(426, 1)\n",
"(143, 1)\n",
"\n"
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=1)\n",
"print(ytrain.shape)\n",
"print(ytest.shape)\n",
"print(type(ytest))"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.neighbors import KNeighborsClassifier\n",
"import numpy as np\n",
"\n",
"knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')\n",
"knn.fit(Xtrain, ytrain.to_numpy().ravel())\n",
"ypred = knn.predict(Xtest).reshape([143,1])"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Prediction | \n",
" Actual | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Prediction Actual\n",
"0 1 1\n",
"1 1 0\n",
"2 1 1\n",
"3 0 0\n",
"4 0 0"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(np.hstack([ypred,ytest]),columns=['Prediction','Actual']).head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluation"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[42 13]\n",
" [ 9 79]]\n"
]
}
],
"source": [
"from sklearn.metrics import confusion_matrix\n",
"print(confusion_matrix(ytest, ypred))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}