{
 "metadata": {
  "name": "01_24_doing_computations"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [
      "# For the sqrt function\n",
      "from math import sqrt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this class we shall be concentrating on what we have learnt to write some programs. This begins the math\n",
      "part of your course. To start with let us do some statistical computations. We shall input the data as lists\n",
      "of floats. We shall see later how to read such data from files."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "The data"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [
      "data_mid = [23, 45, 83, 90, 12, 87, 67, 69, 74, 36, 43, 69, 66, 70]\n",
      "data_end = [45, 44, 95, 87, 24, 100, 45, 70, 66, 32, 50, 55, 80, 81]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Mean"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "First let us find the means to see if the class performance had any improvement after mid-sem. Let us write it\n",
      "as a function. Recall that mean of a collection of numbers, $x_i$, $i = 1, \\ldots, n$ is given by the formula\n",
      "\n",
      "\\begin{equation}\n",
      "\\text{mean} = \\frac{\\sum_{i = 1}^n x_i}{n}.\n",
      "\\end{equation}"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [
      "def find_mean(lst) :\n",
      "    \"\"\"Given a list as input, this function computes the mean.\"\"\"\n",
      "    # Store the sum in a variable\n",
      "    sum = 0.0                               # Quiz : Why 0.0 and not just 0?\n",
      "    # loop over all the numbers in the list\n",
      "    for no in lst :\n",
      "        sum += no\n",
      "        \n",
      "    # The number of elements in the list\n",
      "    no_of_entries = len(lst)\n",
      "    \n",
      "    # mean by definition is\n",
      "    mean = sum/no_of_entries\n",
      "    \n",
      "    # now, we return the value\n",
      "    return mean"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let us try this out"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print \"\"\"The mean for mid-sem is %6.2f,\n",
      "while that for the end-sem is %6.2f.\"\"\" % (find_mean(data_mid), find_mean(data_end))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "The mean for mid-sem is  59.57,\n",
        "while that for the end-sem is  62.43.\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Standard deviation"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Looks like there is a general increase. But increase of mean improvement. It might have been that some\n",
      "substantial number of people actually did slightly worse, but a few people did exceptionally well in the end\n",
      "sem. One measure to check the spread is call standard deviation. The formula, for $x_i$ as above, is\n",
      "\n",
      "\\begin{equation}\n",
      "\\text{standard deviation} = \\sqrt{\\frac{1}{n} \\sum_{i=1}^n (x_i - \\mu)^2} \n",
      "\\end{equation}\n",
      "\n",
      "where $\\mu$ is the mean."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let us try to code this."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [
      "def find_sd(lst) :\n",
      "    \"\"\"Given a list, this function computes the (biased) standard deviation.\"\"\"\n",
      "    # This one depends on the function find_mean (). Let us find the mean.\n",
      "    mu = find_mean(lst)\n",
      "    \n",
      "    # Rest of the code is similar to mean. Introduce a variable to store the sum of squares.\n",
      "    sum_sq_dev = 0.0 \n",
      "    \n",
      "    # Loop over the data to find this sum of squared deviations from the mean.\n",
      "    for no in lst :\n",
      "        sum_sq_dev += (no - mu) ** 2\n",
      "        \n",
      "    # To compute s.d. we also need to know the number of data points:\n",
      "    n = len(lst)\n",
      "        \n",
      "    # Now to finish computing sd, we just need to divide by n and take square root.\n",
      "    sd = sqrt(sum_sq_dev / n)\n",
      "    \n",
      "    # Don't ever forget to return your hard work.\n",
      "    return sd\n",
      "    "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let us try it out."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print \"\"\"The standard deviations for the two exams are\n",
      "    %6.2f                 ,               %6.2f\n",
      "respectively.\"\"\" % (find_sd(data_mid), find_sd(data_end))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "The standard deviations for the two exams are\n",
        "     23.09                 ,                22.93\n",
        "respectively.\n"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This code is very intuitive. However we are running through the data twice, once for computing mean and once\n",
      "for standard deviation. To save a bit of work, one can do a bit of simplification : Note that \n",
      "$n \\mu = \\sum_i x_i$. Therefore, \n",
      "\\begin{equation}\n",
      "\\sum_i (x_i - \\mu)^2 = \\sum_i (x_i^2 - 2\\mu x_i + \\mu^2)\n",
      "= \\sum_i x_i^2 - 2 \\mu \\sum_i x_i + n \\mu^2 = \\sum_i x_i^2 - 2n\\mu^2 + n \\mu^2\n",
      "= \\sum_i x_i^2 - n \\mu^2 \\\\\n",
      "= \\sum_i x_i^2 -  \\frac{1}{n}\\left(\\sum_i x_i\\right)^2.\n",
      "\\end{equation}"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "To make use of this we use one loop to compute both sum of the numbers and sum of their squares. Then use\n",
      "these computations to compute the sd."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [
      "def find_sd2(lst) :\n",
      "    \"\"\"Given a list, this function computes the (biased) standard deviation more efficiently.\"\"\"\n",
      "    \n",
      "    # We need a variable to store the sum, and another one to store the sum of squares.\n",
      "    sum = 0.0\n",
      "    sum_sq = 0.0\n",
      "    \n",
      "    # Loop over the data to find the sum and the sum of squares.\n",
      "    for no in lst :\n",
      "        sum += no\n",
      "        sum_sq += no ** 2\n",
      "        \n",
      "    # To compute s.d., and the sum of squares of deviations,  we also need to know the number \n",
      "    # of data points:\n",
      "    n = len(lst)\n",
      "\n",
      "    # Using this compute the sum of squares of deviations\n",
      "    sum_sq_dev = sum_sq - sum**2 / n\n",
      "        \n",
      "       \n",
      "    # Now to finish computing sd, we just need to divide by n and take square root.\n",
      "    sd = sqrt(sum_sq_dev / n)\n",
      "    \n",
      "    # Don't ever forget to return your hard work.\n",
      "    return sd\n",
      "    "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let us try to use it :"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print \"\"\"The standard deviations (using the second function) for the two exams are\n",
      "    %6.2f                 ,               %6.2f\n",
      "respectively.\"\"\" % (find_sd2(data_mid), find_sd2(data_end))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "The standard deviations (using the second function) for the two exams are\n",
        "     23.09                 ,                22.93\n",
        "respectively.\n"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Correlation"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Things seem to be better. But have the people who scored high in the first exam, score high in the second too?\n",
      "To see that there is a measure called correlation. The formula is used on two sets of data and the formula \n",
      "spills out a value between -1 and 1. The formula is\n",
      "\n",
      "\\begin{equation}\n",
      "\\text{Correlation} = \\frac{\\text{Covariance}}{(\\text{s.d. of } X)(\\text{s.d. of } Y)}\n",
      "\\end{equation}\n",
      "\n",
      "where\n",
      "\n",
      "\\begin{equation}\n",
      "\\text{Covariance} = \\frac{1}{n} \\sum_{i=1}^n (x_i - m_x)(y_i - m_y)\n",
      "\\end{equation}\n",
      "\n",
      "x_i, y_i being the data given of size n, m_x and m_y being the of x and y resp and n is the number of data pairs."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As before we we try to simplify the formula so that we can compute using just one loop to compute.\n",
      "\n",
      "\\begin{equation}\\sum_i (x_i - m_x)(y_i - m_y) = \\sum_i x_i y_i - m_x \\sum_i y_i - m_y \\sum_i x_i + n m_x m_y \n",
      "= \\sum_i x_i y_i - n m_x m_y - n m_x m_y + n m_x m_y\\end{equation}\n",
      "\n",
      "\\begin{equation}= \\sum_i x_i y_i - \\frac{1}{n} \\left(\\sum_i x_i\\right)\\left(\\sum_i y_i\\right).\\end{equation}"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "It makes sense to write the correlation function for a list of pairs. We can use that on our data using zip."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def my_corr(lst_of_2_tuples) :\n",
      "    \"\"\"Given a list of 2-tuples, this functions computes the correlation between the first entries and the\n",
      "    second entries.\"\"\"\n",
      "    \n",
      "    # As before we use a huge bunch of variables.\n",
      "    sumx = 0.0\n",
      "    sumy = 0.0\n",
      "    sumxy = 0.0\n",
      "    sumx2 = 0.0\n",
      "    sumy2 = 0.0\n",
      "    \n",
      "    # Now loop\n",
      "    for pair in lst_of_2_tuples :\n",
      "        # To make reading easier, set\n",
      "        x = pair[0]\n",
      "        y = pair[1]\n",
      "        \n",
      "        # Now accumulate\n",
      "        sumx += x\n",
      "        sumy += x\n",
      "        sumxy += x * y\n",
      "        sumx2 += x * x\n",
      "        sumy2 += y * y\n",
      "    \n",
      "    # Now we got all the ingredients to compute covariance and s.d. except n :\n",
      "    n = len(lst_of_2_tuples)\n",
      "    \n",
      "    # Now compute\n",
      "    covariance = sumxy - sumx * sumy / n\n",
      "    sdx = sqrt(sumx2 - sumx**2/n)\n",
      "    sdy = sqrt(sumy2 - sumy**2/n)\n",
      "    \n",
      "    correlation = covariance / (sdx * sdy)\n",
      "    \n",
      "    return correlation"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let us try this"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print \"Correlation is %6.4f.\" % (my_corr(zip(data_mid, data_end)))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Correlation is 0.9221.\n"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Some general remarks"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Soon we shall learn how to read a data from a file (and if time permits, from a webpage.)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Try this website : http://people.csail.mit.edu/pgbovine/python/tutor.html ."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "List comprehension"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "(Ref Pg. 63) Syntax : new_list = [f(e) for e in some_other_list]"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "list_of_first_100_odds = [(2*n + 1) for n in range(100)]\n",
      "print list_of_first_100_odds"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 97, 99, 101, 103, 105, 107, 109, 111, 113, 115, 117, 119, 121, 123, 125, 127, 129, 131, 133, 135, 137, 139, 141, 143, 145, 147, 149, 151, 153, 155, 157, 159, 161, 163, 165, 167, 169, 171, 173, 175, 177, 179, 181, 183, 185, 187, 189, 191, 193, 195, 197, 199]\n"
       ]
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "random_list1 = [3, 6, 1]\n",
      "random_list2 = [40, 70]\n",
      "sum_list = [[(i + j) for i in random_list1] for j in random_list2]\n",
      "print sum_list"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[[43, 46, 41], [73, 76, 71]]\n"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can traverse a list also as follows"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "marks = zip(data_mid, data_end)\n",
      "print \"Marks : \",\n",
      "print marks\n",
      "print \"-\"*70\n",
      "print \"Mid\\tEnd\"\n",
      "for m, e in marks :\n",
      "    print \"%5.1f\\t%5.1f\" % (m, e)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Marks :  [(23, 45), (45, 44), (83, 95), (90, 87), (12, 24), (87, 100), (67, 45), (69, 70), (74, 66), (36, 32), (43, 50), (69, 55), (66, 80), (70, 81)]\n",
        "----------------------------------------------------------------------\n",
        "Mid\tEnd\n",
        " 23.0\t 45.0\n",
        " 45.0\t 44.0\n",
        " 83.0\t 95.0\n",
        " 90.0\t 87.0\n",
        " 12.0\t 24.0\n",
        " 87.0\t100.0\n",
        " 67.0\t 45.0\n",
        " 69.0\t 70.0\n",
        " 74.0\t 66.0\n",
        " 36.0\t 32.0\n",
        " 43.0\t 50.0\n",
        " 69.0\t 55.0\n",
        " 66.0\t 80.0\n",
        " 70.0\t 81.0\n"
       ]
      }
     ],
     "prompt_number": 17
    }
   ],
   "metadata": {}
  }
 ]
}