{ "cells": [ { "cell_type": "markdown", "id": "b9178b48", "metadata": {}, "source": [ "# Distributions and Probabilities\n", "\n", "\n", "<a id='index-0'></a>" ] }, { "cell_type": "markdown", "id": "b65caf11", "metadata": {}, "source": [ "## Outline\n", "\n", "In this lecture we give a quick introduction to data and probability distributions using Python." ] }, { "cell_type": "code", "execution_count": null, "id": "f0a6373c", "metadata": { "hide-output": false }, "outputs": [], "source": [ "!pip install --upgrade yfinance " ] }, { "cell_type": "code", "execution_count": null, "id": "498a26e5", "metadata": { "hide-output": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import numpy as np\n", "import yfinance as yf\n", "import scipy.stats\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "id": "90e93e5f", "metadata": {}, "source": [ "## Common distributions\n", "\n", "In this section we recall the definitions of some well-known distributions and explore how to manipulate them with SciPy." ] }, { "cell_type": "markdown", "id": "9fb5976c", "metadata": {}, "source": [ "### Discrete distributions\n", "\n", "Let’s start with discrete distributions.\n", "\n", "A discrete distribution is defined by a set of numbers $ S = \\{x_1, \\ldots, x_n\\} $ and a **probability mass function** (PMF) on $ S $, which is a function $ p $ from $ S $ to $ [0,1] $ with the property\n", "\n", "$$\n", "\\sum_{i=1}^n p(x_i) = 1\n", "$$\n", "\n", "We say that a random variable $ X $ **has distribution** $ p $ if $ X $ takes value $ x_i $ with probability $ p(x_i) $.\n", "\n", "That is,\n", "\n", "$$\n", "\\mathbb P\\{X = x_i\\} = p(x_i) \\quad \\text{for } i= 1, \\ldots, n\n", "$$\n", "\n", "The **mean** or **expected value** of a random variable $ X $ with distribution $ p $ is\n", "\n", "$$\n", "\\mathbb{E}[X] = \\sum_{i=1}^n x_i p(x_i)\n", "$$\n", "\n", "Expectation is also called the *first moment* of the distribution.\n", "\n", "We also refer to this number as the mean of the distribution (represented by) $ p $.\n", "\n", "The **variance** of $ X $ is defined as\n", "\n", "$$\n", "\\mathbb{V}[X] = \\sum_{i=1}^n (x_i - \\mathbb{E}[X])^2 p(x_i)\n", "$$\n", "\n", "Variance is also called the *second central moment* of the distribution.\n", "\n", "The **cumulative distribution function** (CDF) of $ X $ is defined by\n", "\n", "$$\n", "F(x) = \\mathbb{P}\\{X \\leq x\\}\n", " = \\sum_{i=1}^n \\mathbb 1\\{x_i \\leq x\\} p(x_i)\n", "$$\n", "\n", "Here $ \\mathbb 1\\{ \\textrm{statement} \\} = 1 $ if “statement” is true and zero otherwise.\n", "\n", "Hence the second term takes all $ x_i \\leq x $ and sums their probabilities." ] }, { "cell_type": "markdown", "id": "1316d990", "metadata": {}, "source": [ "#### Uniform distribution\n", "\n", "One simple example is the **uniform distribution**, where $ p(x_i) = 1/n $ for all $ i $.\n", "\n", "We can import the uniform distribution on $ S = \\{1, \\ldots, n\\} $ from SciPy like so:" ] }, { "cell_type": "code", "execution_count": null, "id": "9ab27562", "metadata": { "hide-output": false }, "outputs": [], "source": [ "n = 10\n", "u = scipy.stats.randint(1, n+1)" ] }, { "cell_type": "markdown", "id": "213aa8ac", "metadata": {}, "source": [ "Here’s the mean and variance:" ] }, { "cell_type": "code", "execution_count": null, "id": "50ae6f0d", "metadata": { "hide-output": false }, "outputs": [], "source": [ "u.mean(), u.var()" ] }, { "cell_type": "markdown", "id": "fb42be55", "metadata": {}, "source": [ "The formula for the mean is $ (n+1)/2 $, and the formula for the variance is $ (n^2 - 1)/12 $.\n", "\n", "Now let’s evaluate the PMF:" ] }, { "cell_type": "code", "execution_count": null, "id": "219bde86", "metadata": { "hide-output": false }, "outputs": [], "source": [ "u.pmf(1)" ] }, { "cell_type": "code", "execution_count": null, "id": "3ab7a7a8", "metadata": { "hide-output": false }, "outputs": [], "source": [ "u.pmf(2)" ] }, { "cell_type": "markdown", "id": "e414e648", "metadata": {}, "source": [ "Here’s a plot of the probability mass function:" ] }, { "cell_type": "code", "execution_count": null, "id": "484ab925", "metadata": { "hide-output": false }, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "S = np.arange(1, n+1)\n", "ax.plot(S, u.pmf(S), linestyle='', marker='o', alpha=0.8, ms=4)\n", "ax.vlines(S, 0, u.pmf(S), lw=0.2)\n", "ax.set_xticks(S)\n", "ax.set_xlabel('S')\n", "ax.set_ylabel('PMF')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "3a16f419", "metadata": {}, "source": [ "Here’s a plot of the CDF:" ] }, { "cell_type": "code", "execution_count": null, "id": "2d6d87d2", "metadata": { "hide-output": false }, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "S = np.arange(1, n+1)\n", "ax.step(S, u.cdf(S))\n", "ax.vlines(S, 0, u.cdf(S), lw=0.2)\n", "ax.set_xticks(S)\n", "ax.set_xlabel('S')\n", "ax.set_ylabel('CDF')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "c0af0076", "metadata": {}, "source": [ "The CDF jumps up by $ p(x_i) $ at $ x_i $." ] }, { "cell_type": "markdown", "id": "6577a481", "metadata": {}, "source": [ "#### Exercise 19.1\n", "\n", "Calculate the mean and variance for this parameterization (i.e., $ n=10 $)\n", "directly from the PMF, using the expressions given above.\n", "\n", "Check that your answers agree with `u.mean()` and `u.var()`." ] }, { "cell_type": "markdown", "id": "2f398235", "metadata": {}, "source": [ "#### Bernoulli distribution\n", "\n", "Another useful distribution is the Bernoulli distribution on $ S = \\{0,1\\} $, which has PMF:\n", "\n", "$$\n", "p(i) = \\theta^i (1 - \\theta)^{1-i}\n", "\\qquad (i = 0, 1)\n", "$$\n", "\n", "Here $ \\theta \\in [0,1] $ is a parameter.\n", "\n", "We can think of this distribution as modeling probabilities for a random trial with success probability $ \\theta $.\n", "\n", "- $ p(1) = \\theta $ means that the trial succeeds (takes value 1) with probability $ \\theta $ \n", "- $ p(0) = 1 - \\theta $ means that the trial fails (takes value 0) with\n", " probability $ 1-\\theta $ \n", "\n", "\n", "The formula for the mean is $ \\theta $, and the formula for the variance is $ \\theta(1-\\theta) $.\n", "\n", "We can import the Bernoulli distribution on $ S = \\{0,1\\} $ from SciPy like so:" ] }, { "cell_type": "code", "execution_count": null, "id": "3e668755", "metadata": { "hide-output": false }, "outputs": [], "source": [ "θ = 0.4\n", "u = scipy.stats.bernoulli(θ)" ] }, { "cell_type": "markdown", "id": "98b3f700", "metadata": {}, "source": [ "Here’s the mean and variance at $ \\theta=0.4 $" ] }, { "cell_type": "code", "execution_count": null, "id": "29de9c82", "metadata": { "hide-output": false }, "outputs": [], "source": [ "u.mean(), u.var()" ] }, { "cell_type": "markdown", "id": "fac39581", "metadata": {}, "source": [ "We can evaluate the PMF as follows" ] }, { "cell_type": "code", "execution_count": null, "id": "b0247db8", "metadata": { "hide-output": false }, "outputs": [], "source": [ "u.pmf(0), u.pmf(1)" ] }, { "cell_type": "markdown", "id": "ec961248", "metadata": {}, "source": [ "#### Binomial distribution\n", "\n", "Another useful (and more interesting) distribution is the **binomial distribution** on $ S=\\{0, \\ldots, n\\} $, which has PMF:\n", "\n", "$$\n", "p(i) = \\binom{n}{i} \\theta^i (1-\\theta)^{n-i}\n", "$$\n", "\n", "Again, $ \\theta \\in [0,1] $ is a parameter.\n", "\n", "The interpretation of $ p(i) $ is: the probability of $ i $ successes in $ n $ independent trials with success probability $ \\theta $.\n", "\n", "For example, if $ \\theta=0.5 $, then $ p(i) $ is the probability of $ i $ heads in $ n $ flips of a fair coin.\n", "\n", "The formula for the mean is $ n \\theta $ and the formula for the variance is $ n \\theta (1-\\theta) $.\n", "\n", "Let’s investigate an example" ] }, { "cell_type": "code", "execution_count": null, "id": "dc984f47", "metadata": { "hide-output": false }, "outputs": [], "source": [ "n = 10\n", "θ = 0.5\n", "u = scipy.stats.binom(n, θ)" ] }, { "cell_type": "markdown", "id": "ce1b7547", "metadata": {}, "source": [ "According to our formulas, the mean and variance are" ] }, { "cell_type": "code", "execution_count": null, "id": "e0d4fa4f", "metadata": { "hide-output": false }, "outputs": [], "source": [ "n * θ, n * θ * (1 - θ) " ] }, { "cell_type": "markdown", "id": "476e5e77", "metadata": {}, "source": [ "Let’s see if SciPy gives us the same results:" ] }, { "cell_type": "code", "execution_count": null, "id": "888f3f27", "metadata": { "hide-output": false }, "outputs": [], "source": [ "u.mean(), u.var()" ] }, { "cell_type": "markdown", "id": "661c2dd4", "metadata": {}, "source": [ "Here’s the PMF:" ] }, { "cell_type": "code", "execution_count": null, "id": "c533ed7d", "metadata": { "hide-output": false }, "outputs": [], "source": [ "u.pmf(1)" ] }, { "cell_type": "code", "execution_count": null, "id": "ad56058e", "metadata": { "hide-output": false }, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "S = np.arange(1, n+1)\n", "ax.plot(S, u.pmf(S), linestyle='', marker='o', alpha=0.8, ms=4)\n", "ax.vlines(S, 0, u.pmf(S), lw=0.2)\n", "ax.set_xticks(S)\n", "ax.set_xlabel('S')\n", "ax.set_ylabel('PMF')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "445928eb", "metadata": {}, "source": [ "Here’s the CDF:" ] }, { "cell_type": "code", "execution_count": null, "id": "0379109d", "metadata": { "hide-output": false }, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "S = np.arange(1, n+1)\n", "ax.step(S, u.cdf(S))\n", "ax.vlines(S, 0, u.cdf(S), lw=0.2)\n", "ax.set_xticks(S)\n", "ax.set_xlabel('S')\n", "ax.set_ylabel('CDF')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "3ae50ba7", "metadata": {}, "source": [ "#### Exercise 19.2\n", "\n", "Using `u.pmf`, check that our definition of the CDF given above calculates the same function as `u.cdf`." ] }, { "cell_type": "markdown", "id": "261dca2c", "metadata": {}, "source": [ "#### Solution to[ Exercise 19.2](https://intro.quantecon.org/#prob_ex3)\n", "\n", "Here is one solution:" ] }, { "cell_type": "code", "execution_count": null, "id": "e90de30b", "metadata": { "hide-output": false }, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "S = np.arange(1, n+1)\n", "u_sum = np.cumsum(u.pmf(S))\n", "ax.step(S, u_sum)\n", "ax.vlines(S, 0, u_sum, lw=0.2)\n", "ax.set_xticks(S)\n", "ax.set_xlabel('S')\n", "ax.set_ylabel('CDF')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "12893c37", "metadata": {}, "source": [ "We can see that the output graph is the same as the one above." ] }, { "cell_type": "markdown", "id": "bffae2d4", "metadata": {}, "source": [ "#### Geometric distribution\n", "\n", "The geometric distribution has infinite support $ S = \\{0, 1, 2, \\ldots\\} $ and its PMF is given by\n", "\n", "$$\n", "p(i) = (1 - \\theta)^i \\theta\n", "$$\n", "\n", "where $ \\theta \\in [0,1] $ is a parameter\n", "\n", "(A discrete distribution has infinite support if the set of points to which it assigns positive probability is infinite.)\n", "\n", "To understand the distribution, think of repeated independent random trials, each with success probability $ \\theta $.\n", "\n", "The interpretation of $ p(i) $ is: the probability there are $ i $ failures before the first success occurs.\n", "\n", "It can be shown that the mean of the distribution is $ 1/\\theta $ and the variance is $ (1-\\theta)/\\theta $.\n", "\n", "Here’s an example." ] }, { "cell_type": "code", "execution_count": null, "id": "3a1074a6", "metadata": { "hide-output": false }, "outputs": [], "source": [ "θ = 0.1\n", "u = scipy.stats.geom(θ)\n", "u.mean(), u.var()" ] }, { "cell_type": "markdown", "id": "5c33ad04", "metadata": {}, "source": [ "Here’s part of the PMF:" ] }, { "cell_type": "code", "execution_count": null, "id": "2f3024a4", "metadata": { "hide-output": false }, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "n = 20\n", "S = np.arange(n)\n", "ax.plot(S, u.pmf(S), linestyle='', marker='o', alpha=0.8, ms=4)\n", "ax.vlines(S, 0, u.pmf(S), lw=0.2)\n", "ax.set_xticks(S)\n", "ax.set_xlabel('S')\n", "ax.set_ylabel('PMF')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "0b4ee58d", "metadata": {}, "source": [ "#### Poisson distribution\n", "\n", "The Poisson distribution on $ S = \\{0, 1, \\ldots\\} $ with parameter $ \\lambda > 0 $ has PMF\n", "\n", "$$\n", "p(i) = \\frac{\\lambda^i}{i!} e^{-\\lambda}\n", "$$\n", "\n", "The interpretation of $ p(i) $ is: the probability of $ i $ events in a fixed time interval, where the events occur independently at a constant rate $ \\lambda $.\n", "\n", "It can be shown that the mean is $ \\lambda $ and the variance is also $ \\lambda $.\n", "\n", "Here’s an example." ] }, { "cell_type": "code", "execution_count": null, "id": "f5130143", "metadata": { "hide-output": false }, "outputs": [], "source": [ "λ = 2\n", "u = scipy.stats.poisson(λ)\n", "u.mean(), u.var()" ] }, { "cell_type": "markdown", "id": "2e7b40ad", "metadata": {}, "source": [ "Here’s the PMF:" ] }, { "cell_type": "code", "execution_count": null, "id": "ca11e906", "metadata": { "hide-output": false }, "outputs": [], "source": [ "u.pmf(1)" ] }, { "cell_type": "code", "execution_count": null, "id": "af549aa3", "metadata": { "hide-output": false }, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "S = np.arange(1, n+1)\n", "ax.plot(S, u.pmf(S), linestyle='', marker='o', alpha=0.8, ms=4)\n", "ax.vlines(S, 0, u.pmf(S), lw=0.2)\n", "ax.set_xticks(S)\n", "ax.set_xlabel('S')\n", "ax.set_ylabel('PMF')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "24c4651d", "metadata": {}, "source": [ "### Continuous distributions\n", "\n", "A continuous distribution is represented by a **probability density function**, which is a function $ p $ over $ \\mathbb R $ (the set of all real numbers) such that $ p(x) \\geq 0 $ for all $ x $ and\n", "\n", "$$\n", "\\int_{-\\infty}^\\infty p(x) dx = 1\n", "$$\n", "\n", "We say that random variable $ X $ has distribution $ p $ if\n", "\n", "$$\n", "\\mathbb P\\{a < X < b\\} = \\int_a^b p(x) dx\n", "$$\n", "\n", "for all $ a \\leq b $.\n", "\n", "The definition of the mean and variance of a random variable $ X $ with distribution $ p $ are the same as the discrete case, after replacing the sum with an integral.\n", "\n", "For example, the mean of $ X $ is\n", "\n", "$$\n", "\\mathbb{E}[X] = \\int_{-\\infty}^\\infty x p(x) dx\n", "$$\n", "\n", "The **cumulative distribution function** (CDF) of $ X $ is defined by\n", "\n", "$$\n", "F(x) = \\mathbb P\\{X \\leq x\\}\n", " = \\int_{-\\infty}^x p(x) dx\n", "$$" ] }, { "cell_type": "markdown", "id": "7d1e1dbe", "metadata": {}, "source": [ "#### Normal distribution\n", "\n", "Perhaps the most famous distribution is the **normal distribution**, which has density\n", "\n", "$$\n", "p(x) = \\frac{1}{\\sqrt{2\\pi}\\sigma}\n", " \\exp\\left(-\\frac{(x-\\mu)^2}{2\\sigma^2}\\right)\n", "$$\n", "\n", "This distribution has two parameters, $ \\mu \\in \\mathbb R $ and $ \\sigma \\in (0, \\infty) $.\n", "\n", "Using calculus, it can be shown that, for this distribution, the mean is $ \\mu $ and the variance is $ \\sigma^2 $.\n", "\n", "We can obtain the moments, PDF and CDF of the normal density via SciPy as follows:" ] }, { "cell_type": "code", "execution_count": null, "id": "3acd2e3d", "metadata": { "hide-output": false }, "outputs": [], "source": [ "μ, σ = 0.0, 1.0\n", "u = scipy.stats.norm(μ, σ)" ] }, { "cell_type": "code", "execution_count": null, "id": "a7f6ea1f", "metadata": { "hide-output": false }, "outputs": [], "source": [ "u.mean(), u.var()" ] }, { "cell_type": "markdown", "id": "90abcc9f", "metadata": {}, "source": [ "Here’s a plot of the density — the famous “bell-shaped curve”:" ] }, { "cell_type": "code", "execution_count": null, "id": "86480627", "metadata": { "hide-output": false }, "outputs": [], "source": [ "μ_vals = [-1, 0, 1]\n", "σ_vals = [0.4, 1, 1.6]\n", "fig, ax = plt.subplots()\n", "x_grid = np.linspace(-4, 4, 200)\n", "\n", "for μ, σ in zip(μ_vals, σ_vals):\n", " u = scipy.stats.norm(μ, σ)\n", " ax.plot(x_grid, u.pdf(x_grid),\n", " alpha=0.5, lw=2,\n", " label=rf'$\\mu={μ}, \\sigma={σ}$')\n", "ax.set_xlabel('x')\n", "ax.set_ylabel('PDF')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "f43a1345", "metadata": {}, "source": [ "Here’s a plot of the CDF:" ] }, { "cell_type": "code", "execution_count": null, "id": "0b0663a8", "metadata": { "hide-output": false }, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "for μ, σ in zip(μ_vals, σ_vals):\n", " u = scipy.stats.norm(μ, σ)\n", " ax.plot(x_grid, u.cdf(x_grid),\n", " alpha=0.5, lw=2,\n", " label=rf'$\\mu={μ}, \\sigma={σ}$')\n", " ax.set_ylim(0, 1)\n", "ax.set_xlabel('x')\n", "ax.set_ylabel('CDF')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "af905a9e", "metadata": {}, "source": [ "#### Lognormal distribution\n", "\n", "The **lognormal distribution** is a distribution on $ \\left(0, \\infty\\right) $ with density\n", "\n", "$$\n", "p(x) = \\frac{1}{\\sigma x \\sqrt{2\\pi}}\n", " \\exp \\left(- \\frac{\\left(\\log x - \\mu\\right)^2}{2 \\sigma^2} \\right)\n", "$$\n", "\n", "This distribution has two parameters, $ \\mu $ and $ \\sigma $.\n", "\n", "It can be shown that, for this distribution, the mean is $ \\exp\\left(\\mu + \\sigma^2/2\\right) $ and the variance is $ \\left[\\exp\\left(\\sigma^2\\right) - 1\\right] \\exp\\left(2\\mu + \\sigma^2\\right) $.\n", "\n", "It can be proved that\n", "\n", "- if $ X $ is lognormally distributed, then $ \\log X $ is normally distributed, and \n", "- if $ X $ is normally distributed, then $ \\exp X $ is lognormally distributed. \n", "\n", "\n", "We can obtain the moments, PDF, and CDF of the lognormal density as follows:" ] }, { "cell_type": "code", "execution_count": null, "id": "2ea5b6f0", "metadata": { "hide-output": false }, "outputs": [], "source": [ "μ, σ = 0.0, 1.0\n", "u = scipy.stats.lognorm(s=σ, scale=np.exp(μ))" ] }, { "cell_type": "code", "execution_count": null, "id": "683b3ace", "metadata": { "hide-output": false }, "outputs": [], "source": [ "u.mean(), u.var()" ] }, { "cell_type": "code", "execution_count": null, "id": "9e2e9c85", "metadata": { "hide-output": false }, "outputs": [], "source": [ "μ_vals = [-1, 0, 1]\n", "σ_vals = [0.25, 0.5, 1]\n", "x_grid = np.linspace(0, 3, 200)\n", "\n", "fig, ax = plt.subplots()\n", "for μ, σ in zip(μ_vals, σ_vals):\n", " u = scipy.stats.lognorm(σ, scale=np.exp(μ))\n", " ax.plot(x_grid, u.pdf(x_grid),\n", " alpha=0.5, lw=2,\n", " label=fr'$\\mu={μ}, \\sigma={σ}$')\n", "ax.set_xlabel('x')\n", "ax.set_ylabel('PDF')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "af0ce145", "metadata": { "hide-output": false }, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "μ = 1\n", "for σ in σ_vals:\n", " u = scipy.stats.norm(μ, σ)\n", " ax.plot(x_grid, u.cdf(x_grid),\n", " alpha=0.5, lw=2,\n", " label=rf'$\\mu={μ}, \\sigma={σ}$')\n", " ax.set_ylim(0, 1)\n", " ax.set_xlim(0, 3)\n", "ax.set_xlabel('x')\n", "ax.set_ylabel('CDF')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "45537320", "metadata": {}, "source": [ "#### Exponential distribution\n", "\n", "The **exponential distribution** is a distribution supported on $ \\left(0, \\infty\\right) $ with density\n", "\n", "$$\n", "p(x) = \\lambda \\exp \\left( - \\lambda x \\right)\n", "\\qquad (x > 0)\n", "$$\n", "\n", "This distribution has one parameter $ \\lambda $.\n", "\n", "The exponential distribution can be thought of as the continuous analog of the geometric distribution.\n", "\n", "It can be shown that, for this distribution, the mean is $ 1/\\lambda $ and the variance is $ 1/\\lambda^2 $.\n", "\n", "We can obtain the moments, PDF, and CDF of the exponential density as follows:" ] }, { "cell_type": "code", "execution_count": null, "id": "a1812b66", "metadata": { "hide-output": false }, "outputs": [], "source": [ "λ = 1.0\n", "u = scipy.stats.expon(scale=1/λ)" ] }, { "cell_type": "code", "execution_count": null, "id": "1ac1e7ec", "metadata": { "hide-output": false }, "outputs": [], "source": [ "u.mean(), u.var()" ] }, { "cell_type": "code", "execution_count": null, "id": "99694764", "metadata": { "hide-output": false }, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "λ_vals = [0.5, 1, 2]\n", "x_grid = np.linspace(0, 6, 200)\n", "\n", "for λ in λ_vals:\n", " u = scipy.stats.expon(scale=1/λ)\n", " ax.plot(x_grid, u.pdf(x_grid),\n", " alpha=0.5, lw=2,\n", " label=rf'$\\lambda={λ}$')\n", "ax.set_xlabel('x')\n", "ax.set_ylabel('PDF')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "2c6e5abc", "metadata": { "hide-output": false }, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "for λ in λ_vals:\n", " u = scipy.stats.expon(scale=1/λ)\n", " ax.plot(x_grid, u.cdf(x_grid),\n", " alpha=0.5, lw=2,\n", " label=rf'$\\lambda={λ}$')\n", " ax.set_ylim(0, 1)\n", "ax.set_xlabel('x')\n", "ax.set_ylabel('CDF')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "ca08bae1", "metadata": {}, "source": [ "#### Beta distribution\n", "\n", "The **beta distribution** is a distribution on $ (0, 1) $ with density\n", "\n", "$$\n", "p(x) = \\frac{\\Gamma(\\alpha + \\beta)}{\\Gamma(\\alpha) \\Gamma(\\beta)}\n", " x^{\\alpha - 1} (1 - x)^{\\beta - 1}\n", "$$\n", "\n", "where $ \\Gamma $ is the [gamma function](https://en.wikipedia.org/wiki/Gamma_function).\n", "\n", "(The role of the gamma function is just to normalize the density, so that it\n", "integrates to one.)\n", "\n", "This distribution has two parameters, $ \\alpha > 0 $ and $ \\beta > 0 $.\n", "\n", "It can be shown that, for this distribution, the mean is $ \\alpha / (\\alpha + \\beta) $ and\n", "the variance is $ \\alpha \\beta / (\\alpha + \\beta)^2 (\\alpha + \\beta + 1) $.\n", "\n", "We can obtain the moments, PDF, and CDF of the Beta density as follows:" ] }, { "cell_type": "code", "execution_count": null, "id": "55fc1362", "metadata": { "hide-output": false }, "outputs": [], "source": [ "α, β = 3.0, 1.0\n", "u = scipy.stats.beta(α, β)" ] }, { "cell_type": "code", "execution_count": null, "id": "6384c7c3", "metadata": { "hide-output": false }, "outputs": [], "source": [ "u.mean(), u.var()" ] }, { "cell_type": "code", "execution_count": null, "id": "1b0467a1", "metadata": { "hide-output": false }, "outputs": [], "source": [ "α_vals = [0.5, 1, 5, 25, 3]\n", "β_vals = [3, 1, 10, 20, 0.5]\n", "x_grid = np.linspace(0, 1, 200)\n", "\n", "fig, ax = plt.subplots()\n", "for α, β in zip(α_vals, β_vals):\n", " u = scipy.stats.beta(α, β)\n", " ax.plot(x_grid, u.pdf(x_grid),\n", " alpha=0.5, lw=2,\n", " label=rf'$\\alpha={α}, \\beta={β}$')\n", "ax.set_xlabel('x')\n", "ax.set_ylabel('PDF')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "58dbcefa", "metadata": { "hide-output": false }, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "for α, β in zip(α_vals, β_vals):\n", " u = scipy.stats.beta(α, β)\n", " ax.plot(x_grid, u.cdf(x_grid),\n", " alpha=0.5, lw=2,\n", " label=rf'$\\alpha={α}, \\beta={β}$')\n", " ax.set_ylim(0, 1)\n", "ax.set_xlabel('x')\n", "ax.set_ylabel('CDF')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "e72f1465", "metadata": {}, "source": [ "#### Gamma distribution\n", "\n", "The **gamma distribution** is a distribution on $ \\left(0, \\infty\\right) $ with density\n", "\n", "$$\n", "p(x) = \\frac{\\beta^\\alpha}{\\Gamma(\\alpha)}\n", " x^{\\alpha - 1} \\exp(-\\beta x)\n", "$$\n", "\n", "This distribution has two parameters, $ \\alpha > 0 $ and $ \\beta > 0 $.\n", "\n", "It can be shown that, for this distribution, the mean is $ \\alpha / \\beta $ and\n", "the variance is $ \\alpha / \\beta^2 $.\n", "\n", "One interpretation is that if $ X $ is gamma distributed and $ \\alpha $ is an\n", "integer, then $ X $ is the sum of $ \\alpha $ independent exponentially distributed\n", "random variables with mean $ 1/\\beta $.\n", "\n", "We can obtain the moments, PDF, and CDF of the Gamma density as follows:" ] }, { "cell_type": "code", "execution_count": null, "id": "782eafd0", "metadata": { "hide-output": false }, "outputs": [], "source": [ "α, β = 3.0, 2.0\n", "u = scipy.stats.gamma(α, scale=1/β)" ] }, { "cell_type": "code", "execution_count": null, "id": "3225dcae", "metadata": { "hide-output": false }, "outputs": [], "source": [ "u.mean(), u.var()" ] }, { "cell_type": "code", "execution_count": null, "id": "a2b14cfd", "metadata": { "hide-output": false }, "outputs": [], "source": [ "α_vals = [1, 3, 5, 10]\n", "β_vals = [3, 5, 3, 3]\n", "x_grid = np.linspace(0, 7, 200)\n", "\n", "fig, ax = plt.subplots()\n", "for α, β in zip(α_vals, β_vals):\n", " u = scipy.stats.gamma(α, scale=1/β)\n", " ax.plot(x_grid, u.pdf(x_grid),\n", " alpha=0.5, lw=2,\n", " label=rf'$\\alpha={α}, \\beta={β}$')\n", "ax.set_xlabel('x')\n", "ax.set_ylabel('PDF')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "abe86083", "metadata": { "hide-output": false }, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "for α, β in zip(α_vals, β_vals):\n", " u = scipy.stats.gamma(α, scale=1/β)\n", " ax.plot(x_grid, u.cdf(x_grid),\n", " alpha=0.5, lw=2,\n", " label=rf'$\\alpha={α}, \\beta={β}$')\n", " ax.set_ylim(0, 1)\n", "ax.set_xlabel('x')\n", "ax.set_ylabel('CDF')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "0ab27b70", "metadata": {}, "source": [ "## Observed distributions\n", "\n", "Sometimes we refer to observed data or measurements as “distributions”.\n", "\n", "For example, let’s say we observe the income of 10 people over a year:" ] }, { "cell_type": "code", "execution_count": null, "id": "05e3305a", "metadata": { "hide-output": false }, "outputs": [], "source": [ "data = [['Hiroshi', 1200], \n", " ['Ako', 1210], \n", " ['Emi', 1400],\n", " ['Daiki', 990],\n", " ['Chiyo', 1530],\n", " ['Taka', 1210],\n", " ['Katsuhiko', 1240],\n", " ['Daisuke', 1124],\n", " ['Yoshi', 1330],\n", " ['Rie', 1340]]\n", "\n", "df = pd.DataFrame(data, columns=['name', 'income'])\n", "df" ] }, { "cell_type": "markdown", "id": "77e58a30", "metadata": {}, "source": [ "In this situation, we might refer to the set of their incomes as the “income distribution.”\n", "\n", "The terminology is confusing because this set is not a probability distribution\n", "— it’s just a collection of numbers.\n", "\n", "However, as we will see, there are connections between observed distributions (i.e., sets of\n", "numbers like the income distribution above) and probability distributions.\n", "\n", "Below we explore some observed distributions." ] }, { "cell_type": "markdown", "id": "d5c6a2a2", "metadata": {}, "source": [ "### Summary statistics\n", "\n", "Suppose we have an observed distribution with values $ \\{x_1, \\ldots, x_n\\} $\n", "\n", "The **sample mean** of this distribution is defined as\n", "\n", "$$\n", "\\bar x = \\frac{1}{n} \\sum_{i=1}^n x_i\n", "$$\n", "\n", "The **sample variance** is defined as\n", "\n", "$$\n", "\\frac{1}{n} \\sum_{i=1}^n (x_i - \\bar x)^2\n", "$$\n", "\n", "For the income distribution given above, we can calculate these numbers via" ] }, { "cell_type": "code", "execution_count": null, "id": "4af4e1b1", "metadata": { "hide-output": false }, "outputs": [], "source": [ "x = df['income']\n", "x.mean(), x.var()" ] }, { "cell_type": "markdown", "id": "491da253", "metadata": {}, "source": [ "### Exercise 19.3\n", "\n", "If you try to check that the formulas given above for the sample mean and sample\n", "variance produce the same numbers, you will see that the variance isn’t quite\n", "right. This is because SciPy uses $ 1/(n-1) $ instead of $ 1/n $ as the term at the\n", "front of the variance. (Some books define the sample variance this way.)\n", "Confirm." ] }, { "cell_type": "markdown", "id": "a4ad7d1a", "metadata": {}, "source": [ "### Visualization\n", "\n", "Let’s look at different ways that we can visualize one or more observed distributions.\n", "\n", "We will cover\n", "\n", "- histograms \n", "- kernel density estimates and \n", "- violin plots " ] }, { "cell_type": "markdown", "id": "4046e4cd", "metadata": {}, "source": [ "#### Histograms\n", "\n", "We can histogram the income distribution we just constructed as follows" ] }, { "cell_type": "code", "execution_count": null, "id": "08194b2e", "metadata": { "hide-output": false }, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "ax.hist(x, bins=5, density=True, histtype='bar')\n", "ax.set_xlabel('income')\n", "ax.set_ylabel('density')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "1594ca23", "metadata": {}, "source": [ "Let’s look at a distribution from real data.\n", "\n", "In particular, we will look at the monthly return on Amazon shares between 2000/1/1 and 2024/1/1.\n", "\n", "The monthly return is calculated as the percent change in the share price over each month.\n", "\n", "So we will have one observation for each month." ] }, { "cell_type": "code", "execution_count": null, "id": "012e9fca", "metadata": { "hide-output": false }, "outputs": [], "source": [ "df = yf.download('AMZN', '2000-1-1', '2024-1-1', interval='1mo')\n", "prices = df['Close']\n", "x_amazon = prices.pct_change()[1:] * 100\n", "x_amazon.head()" ] }, { "cell_type": "markdown", "id": "39f757db", "metadata": {}, "source": [ "The first observation is the monthly return (percent change) over January 2000, which was" ] }, { "cell_type": "code", "execution_count": null, "id": "0b027ed0", "metadata": { "hide-output": false }, "outputs": [], "source": [ "x_amazon.iloc[0]" ] }, { "cell_type": "markdown", "id": "21ca0cb7", "metadata": {}, "source": [ "Let’s turn the return observations into an array and histogram it." ] }, { "cell_type": "code", "execution_count": null, "id": "7eacaeba", "metadata": { "hide-output": false }, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "ax.hist(x_amazon, bins=20)\n", "ax.set_xlabel('monthly return (percent change)')\n", "ax.set_ylabel('density')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "4c81a0b5", "metadata": {}, "source": [ "#### Kernel density estimates\n", "\n", "Kernel density estimates (KDE) provide a simple way to estimate and visualize the density of a distribution.\n", "\n", "If you are not familiar with KDEs, you can think of them as a smoothed\n", "histogram.\n", "\n", "Let’s have a look at a KDE formed from the Amazon return data." ] }, { "cell_type": "code", "execution_count": null, "id": "af6de934", "metadata": { "hide-output": false }, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "sns.kdeplot(x_amazon, ax=ax)\n", "ax.set_xlabel('monthly return (percent change)')\n", "ax.set_ylabel('KDE')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "e1ded540", "metadata": {}, "source": [ "The smoothness of the KDE is dependent on how we choose the bandwidth." ] }, { "cell_type": "code", "execution_count": null, "id": "979cd923", "metadata": { "hide-output": false }, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "sns.kdeplot(x_amazon, ax=ax, bw_adjust=0.1, alpha=0.5, label=\"bw=0.1\")\n", "sns.kdeplot(x_amazon, ax=ax, bw_adjust=0.5, alpha=0.5, label=\"bw=0.5\")\n", "sns.kdeplot(x_amazon, ax=ax, bw_adjust=1, alpha=0.5, label=\"bw=1\")\n", "ax.set_xlabel('monthly return (percent change)')\n", "ax.set_ylabel('KDE')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "03c76ad9", "metadata": {}, "source": [ "When we use a larger bandwidth, the KDE is smoother.\n", "\n", "A suitable bandwidth is not too smooth (underfitting) or too wiggly (overfitting)." ] }, { "cell_type": "markdown", "id": "d59c9110", "metadata": {}, "source": [ "#### Violin plots\n", "\n", "Another way to display an observed distribution is via a violin plot." ] }, { "cell_type": "code", "execution_count": null, "id": "1746f2d4", "metadata": { "hide-output": false }, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "ax.violinplot(x_amazon)\n", "ax.set_ylabel('monthly return (percent change)')\n", "ax.set_xlabel('KDE')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "516ef688", "metadata": {}, "source": [ "Violin plots are particularly useful when we want to compare different distributions.\n", "\n", "For example, let’s compare the monthly returns on Amazon shares with the monthly return on Costco shares." ] }, { "cell_type": "code", "execution_count": null, "id": "b53071fc", "metadata": { "hide-output": false }, "outputs": [], "source": [ "df = yf.download('COST', '2000-1-1', '2024-1-1', interval='1mo')\n", "prices = df['Close']\n", "x_costco = prices.pct_change()[1:] * 100" ] }, { "cell_type": "code", "execution_count": null, "id": "60cb779c", "metadata": { "hide-output": false }, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "ax.violinplot([x_amazon['AMZN'], x_costco['COST']])\n", "ax.set_ylabel('monthly return (percent change)')\n", "ax.set_xlabel('retailers')\n", "\n", "ax.set_xticks([1, 2])\n", "ax.set_xticklabels(['Amazon', 'Costco'])\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "ca05fb85", "metadata": {}, "source": [ "### Connection to probability distributions\n", "\n", "Let’s discuss the connection between observed distributions and probability distributions.\n", "\n", "Sometimes it’s helpful to imagine that an observed distribution is generated by a particular probability distribution.\n", "\n", "For example, we might look at the returns from Amazon above and imagine that they were generated by a normal distribution.\n", "\n", "(Even though this is not true, it *might* be a helpful way to think about the data.)\n", "\n", "Here we match a normal distribution to the Amazon monthly returns by setting the\n", "sample mean to the mean of the normal distribution and the sample variance equal\n", "to the variance.\n", "\n", "Then we plot the density and the histogram." ] }, { "cell_type": "code", "execution_count": null, "id": "0afae61b", "metadata": { "hide-output": false }, "outputs": [], "source": [ "μ = x_amazon.mean()\n", "σ_squared = x_amazon.var()\n", "σ = np.sqrt(σ_squared)\n", "u = scipy.stats.norm(μ, σ)" ] }, { "cell_type": "code", "execution_count": null, "id": "e6e66f30", "metadata": { "hide-output": false }, "outputs": [], "source": [ "x_grid = np.linspace(-50, 65, 200)\n", "fig, ax = plt.subplots()\n", "ax.plot(x_grid, u.pdf(x_grid))\n", "ax.hist(x_amazon, density=True, bins=40)\n", "ax.set_xlabel('monthly return (percent change)')\n", "ax.set_ylabel('density')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "e524d17c", "metadata": {}, "source": [ "The match between the histogram and the density is not bad but also not very good.\n", "\n", "One reason is that the normal distribution is not really a good fit for this observed data — we will discuss this point again when we talk about [heavy tailed distributions](https://intro.quantecon.org/heavy_tails.html#heavy-tail).\n", "\n", "Of course, if the data really *is* generated by the normal distribution, then the fit will be better.\n", "\n", "Let’s see this in action\n", "\n", "- first we generate random draws from the normal distribution \n", "- then we histogram them and compare with the density. " ] }, { "cell_type": "code", "execution_count": null, "id": "c0b5bb39", "metadata": { "hide-output": false }, "outputs": [], "source": [ "μ, σ = 0, 1\n", "u = scipy.stats.norm(μ, σ)\n", "N = 2000 # Number of observations\n", "x_draws = u.rvs(N)\n", "x_grid = np.linspace(-4, 4, 200)\n", "fig, ax = plt.subplots()\n", "ax.plot(x_grid, u.pdf(x_grid))\n", "ax.hist(x_draws, density=True, bins=40)\n", "ax.set_xlabel('x')\n", "ax.set_ylabel('density')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "b625c136", "metadata": {}, "source": [ "Note that if you keep increasing $ N $, which is the number of observations, the fit will get better and better.\n", "\n", "This convergence is a version of the “law of large numbers”, which we will discuss [later](https://intro.quantecon.org/lln_clt.html#lln-mr)." ] } ], "metadata": { "date": 1750037601.600857, "filename": "prob_dist.md", "kernelspec": { "display_name": "Python", "language": "python3", "name": "python3" }, "title": "Distributions and Probabilities" }, "nbformat": 4, "nbformat_minor": 5 }