004_comission/raymondyaaa/quotation1/lecture_notes/Lab_2.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lab 2: Probability theory and probability density estimation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this lab you will: (i) learn how to generate data from a normal (Gaussian) distribution, and from a multivariate normal distribution; (ii) play around with histograms to get a feel for how they relate to the underlying probability density function in different scenarios; (iii) use the Scikit Learn library to implement and play around with kernel density estimation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Start by importing the libraries we'll use:\n",
    "\n",
    "-  *numpy*, which is the standard numerical Python package.\n",
    "\n",
    "- *scipy.stats*, which is useful for doing statistics\n",
    "\n",
    "\n",
    "-  *matplotlib.pyplot*, which is a general purpose data visualisation library\n",
    "\n",
    "    https://matplotlib.org/3.3.1/api/_as_gen/matplotlib.pyplot.html\n",
    "    \n",
    "\n",
    "- *KernelDensity* from *sklearn.neighbours*. This is part of *scikit-learn*, which is a standard go-to library for obtaining machine learning resources: algorithms, classifiers, metrics, datasets etc.\n",
    "\n",
    "    https://scikit-learn.org/stable/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import scipy.stats as stats\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.neighbors import KernelDensity"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Random number generation and histograms"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is how to generate data that are normally distributed, and plot with a basic histogram. Have a play around with the parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate 1D data from normal distribution\n",
    "nbpts = 50 # Number of points to generate\n",
    "mu = 1 # the mean\n",
    "sigma = 2 # the standard deviation\n",
    "\n",
    "data = np.random.normal(mu,sigma,(nbpts,1)) # makes a column vector\n",
    "\n",
    "# Create histogram. By default, this will show how many points fall in each bin. \n",
    "bins = 10 # This specifies the number of bins, Python decides the interval. This can be OK but means you do not specify exactly where the bins are. \n",
    "plt.hist(data, bins) \n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# A different specification\n",
    "edges = np.linspace(-5,8,14) # This specifies the number of bins and also where the first bin starts from\n",
    "plt.hist(data, edges)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Another specification\n",
    "binsize = 1\n",
    "edges = np.arange(-5,8,binsize) # This specifies a sequence defining the left edge of each bin (and, indirectly, the number of bins) \n",
    "plt.hist(data,edges)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next let's do a normalised histogram.\n",
    "\n",
    "We are using histograms to estimate probability densities. We need to remember that a probability density function has an area under the curve of 1 (when integrating over all values) so in order to be able to compare histogram and probability density function, we need to normalise our histograms.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "binsize = 1\n",
    "edges = np.arange(-5,8,binsize) # varying the step size in the sequence means changing the number of bins.\n",
    "plt.hist(data, edges, density=True) #normalise the y values to get a probability density function (pdf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can superimpose the pdf (probability density function) of the normal distribution we used to generate the data. We will plot it in the same range as that covered by our edges."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "binsize = 1\n",
    "edges = np.arange(-5,8,binsize) # varying the step size in the sequence means changing the number of bins. Consider what happens when changing the value of binsize\n",
    "plt.hist(data, edges, density=True)\n",
    "x = np.arange(-5,8,0.05) # the x values over which I want the PDF\n",
    "y = stats.norm.pdf(x,mu,sigma)\n",
    "plt.plot(x,y,'r-') # plot with red line\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is important to understand that if you draw data from a distribution, the smaller the number of samples, the less likely you are to be able to accurately estimate the distribution.  A simple example is as follows.  Say I have a biased coin (e.g., a coin that returns heads a bit more often than tails).  I know it’s biased but you don’t.  If I only let you throw it 5 times, how likely are you to be able to infer that it is biased?  What if I let you throw it 100 times?  A million times?  The more you can sample, the more your estimated probability density converges to the true probability density."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Play around with the parameters above to get a feel for how the histogram compares with the true probability density function for different sample sizes and different numbers of bins."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Optional homework"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Try generating and plotting some data from another 1-d probability distribution, such as a uniform distribution and/or a log-normal distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Multivariate normal (Gaussian) distribution "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's see how to generate data from a multivariate normal (Gaussian) distribution:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nbpts=100 # number of data points to generate\n",
    "\n",
    "mu=[4,5] # mean of x and mean of y\n",
    "sigma=[1,2] # standard deviation of x and standard deviation of y\n",
    "corr=0.7 # correlation between x and y. This lies between -1 and 1.\n",
    "cov=corr*sigma[0]*sigma[1]  #covariance, which is correlation times the 2 std devs.\n",
    "\n",
    "C=[[sigma[0]**2,cov],[cov,sigma[1]**2]] #  covariance matrix\n",
    "\n",
    "data=np.random.multivariate_normal(mu,C,nbpts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Produce a scatter plot:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.scatter(data[:,0],data[:,1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Play around with the parameters of the multivariate normal distribution and the number of data points to get a sense of how this distribution behaves in theory and in sample. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A note on matrices and vectors in Python"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Above I have programmed in matrices as lists of rows, and vectors simply as lists. For full functionality, they should be declared as numpy arrays. So for example\n",
    "\n",
    "$A=\\left( \\begin{array}{cc} 2 & 1 \\\\ 4 & 3 \\end{array} \\right)$\n",
    "\n",
    "$B=\\left( \\begin{array}{cc} 5 & 2 \\\\ 2 & 2 \\end{array} \\right)$\n",
    "\n",
    "$v=\\left( \\begin{array}{c} 2 \\\\ 1 \\end{array} \\right)$\n",
    "\n",
    "can be written as:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "A=np.array([[2,1],[4,3]])\n",
    "B=np.array([[5,2],[2,2]])\n",
    "v=np.array([2,1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just so you know for future reference, matrix-vector multiplication and matrix-matrix multiplication are then done as follows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.dot(A,v)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# alternatively\n",
    "\n",
    "np.matmul(A,v)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.dot(A,B)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and the transpose can be obtained like this"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "A.T"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Optional homework"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Play around with different ways of visualising 2-d Gaussian data such as that just plotted above."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Kernel Density Estimation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, let's generate some random data from a normal (Gaussian) distribution:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nbpts = 50 # Number of points to generate\n",
    "mu = 1 # the mean\n",
    "sigma = 2 # the standard deviation\n",
    "\n",
    "data = np.random.normal(mu,sigma,(nbpts,1)) # makes a column vector"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we will use kernel density estimation to estimate the probability distribution, using a standard Gaussian function as the kernel.\n",
    "\n",
    "We will use the function KernelDensity from sklearn.neighbors (which has already been imported above). Here are the notes on this function:\n",
    "\n",
    "https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "h = 0.5 # Bandwidth of kernel\n",
    "kde = KernelDensity(bandwidth=h, kernel='gaussian') # Set kernel to Gaussian\n",
    "kde.fit(data) # Estimate the model based on the data\n",
    "\n",
    "pts = np.arange(-5,8,0.05) # The x values over which I want the PDF and the density estimation\n",
    "pts2=pts.reshape(-1,1)  # Convert the x values into a 2-d array. The 1 here specifies there will be 1 column-\n",
    "                        # this is because there is just 1 variable, x. \n",
    "                        # The -1 here specifies that the number of columns will just be that which is \n",
    "                        #implied by the quantity of data.\n",
    "\n",
    "log_dens = kde.score_samples(pts2) # Calculate the density estimates for desired data points (these return log densities)\n",
    "plt.plot(pts2,np.exp(log_dens),'g-') # plot estimated densities in green\n",
    "\n",
    "# Overlay the actual data along the x-axis to see how the kernel density estimation is working\n",
    "y=np.zeros(nbpts)\n",
    "plt.scatter(data,y)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Play around with the width of the kernel and the number of data points (and, if you like, the mean and standard deviation of the distribution)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Optional homework"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Write your own kernel density estimation function without using a library routine like sklearn.neighbors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Further reading on kernel density estimation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An excellent resource on kernel density estimation is this tutorial:\n",
    "\n",
    "https://jakevdp.github.io/PythonDataScienceHandbook/05.13-kernel-density-estimation.html\n",
    "\n",
    "See also here:\n",
    "\n",
    "https://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. (Optional) Playing with Old Faithful data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the file **faithful.csv** is data on eruption lengths and time to next eruption (in minutes) from the Old Faithful geyser that we met in the lectures. This file can be found on Canvas, next to where you found this notebook. Place a copy of it in the same directory from which you are running this notebook. You can then open the data like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "faithful_data=pd.read_csv('faithful.csv')\n",
    "print(faithful_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then you can convert this dataframe into a 2-d array, with eruption lengths in column 0 and waiting times in column 1 like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data=faithful_data.values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Have a go at plotting histograms and scatter plots for these data, and use kernel density estimation to plot an estimate of the probability distribution of waiting times."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}