# -*- coding: utf-8 -*-
# <nbformat>3.0</nbformat>

# <headingcell level=1>

# Correlations, sorting, file input/output

# <headingcell level=3>

# Correlations

# <markdowncell>

# Given a data consisting of a pair of measurements/variables correlation gives a statistical measure of relationship between the two variables. Standard examples are height/weight, scores etc.
# Let us denote the data by (x_i, y_i), i runs from 1 to N say. If they are linearly related the plot of x_i, y_i would lie on a straight line. We want points which more or less lie on a straight line to have correlation 1 or -1 depending upon whether he slope is posiive or negaive, and if he data is random, the correlation should be zero.
# (x_i, y_i) lying on a line means that $(x_i - \mu_x, y_i - \mu_y)$ lie on a line passing through origin. In other words, the vector $(y_1 - \mu_y, \ldots, y_N-\mu_y)$ is a multiple of the vector $(x_1 - \mu_x, \ldots, x_N - \mu_N)$. Cauchy Shwartz inequality for $\mathbb{R}^N$ will then tell us : $\sum_i (x_i - \mu_x)(y_i - \mu_y)/(\sqrt{(\sum_i (x_i -\mu_x)^2) (\sum_i (y_i - \mu_y)^2}$ will lie between -1 and 1, and it attains -1 or 1 only when the vectors are multiples of each other. This motivates the following definition of correlation.

# <markdowncell>

# \begin{equation}
# \text{Correlation} (X_i, Y_i) = \frac{\text{Covariance}(X_i, Y_i)}{\text{s.d.}(X_i)\text{s.d.}(Y_i)}
# \end{equation}
# 
# where
# 
# \begin{equation}
# \text{Covariance}(X_i, Y_i) = \frac{1}{N} \sum_{i=1}^N (X_i - \mu_X)(Y_i - \mu_Y),
# \end{equation}
# 
# $\mu_X$ being the mean of $X_i$'s and $\mu_Y$ being the mean of the $Y_i$'s.

# <markdowncell>

# As before we we try to simplify the formula so that we can compute using just one loop to compute.
# 
# \begin{equation}\sum_i (X_i - \mu_X)(Y_i -\mu_Y) = \sum_i X_i Y_i - \mu_X \sum_i Y_i - \mu_Y \sum_i X_i + N \mu_X \mu_Y 
# = \sum_i X_i Y_i - N \mu_X \mu_Y - N \mu_X \mu_Y + N \mu_X \mu_Y\end{equation}
# 
# \begin{equation}= \sum_i X_i Y_i - \frac{1}{N} \left(\sum_i X_i\right)\left(\sum_i X_i\right).\end{equation}

# <markdowncell>

# Let us try this out

# <codecell>

# For finding standard deviation, we need
from math import sqrt

# <codecell>

def my_corr(lst_of_2_tuples) :
    """Given a list of 2-tuples, this functions computes the correlation between the first entries and the
    second entries."""
    
    # As before we use a huge bunch of variables.
    sumx = 0.0
    sumy = 0.0
    sumxy = 0.0
    sumx2 = 0.0
    sumy2 = 0.0
    
    # Now loop
    for (x, y) in lst_of_2_tuples :
        
        # Now accumulate
        sumx += x
        sumy += y
        sumxy += x * y
        sumx2 += x * x
        sumy2 += y * y
    
    # Now we got all the ingredients to compute covariance and s.d. except n :
    n = len(lst_of_2_tuples)
    
    # Now compute
    covariance = sumxy - sumx * sumy / n
    sdx = sqrt(sumx2 - sumx**2/n)
    sdy = sqrt(sumy2 - sumy**2/n)
    
    if sdx == 0 or sdy == 0 :
        print "\nError: Correlation: One of the variables is constant. Cannot compute correlation."
        correlation = None
    else :
        correlation = covariance / (sdx * sdy)
    
    return correlation

# <codecell>

data_mid = [23, 45, 83, 90, 12, 87, 67, 69, 74, 36, 43, 69, 66, 70]
data_end = [45, 44, 95, 87, 24, 100, 45, 70, 66, 32, 50, 55, 80, 81]

zipped_data = zip(data_mid, data_end)
print "Zipped data : ", zipped_data

# <codecell>

print "Correlation is", my_corr(zipped_data)

# <markdowncell>

# We can experiment

# <codecell>

def test_corr(lst_of_2_tups) :
    print "Correlation of", lst_of_2_tups, "is", my_corr(lst_of_2_tups)

# <codecell>

test_corr([(1, 5), (3, 9), (10, 23), (-2, -1), (0, 3)])
test_corr([(1, 0), (0, 1), (1, 1), (0, 0)])
test_corr([(1, 0), (0, 1), (1, 1)])
test_corr([(x, 1) for x in range(5)])
test_corr([(x**2, x) for x in range(0, 100)])

# <headingcell level=3>

# Sorting

# <markdowncell>

# Given a list of numbers (or any list of sortable elements) we can sort them using the following simple algorithm

# <markdowncell>

# Start at the beginning of the list. Compare the adjacent entries. If they are in wrong order swap. Advance by one place. Repeat till nothing is swapped in on full sweep.

# <codecell>

def horrible_sort(somelist, showstep=False) :
    swapped_during_pass = True
    while (swapped_during_pass) :
        swapped_during_pass = False
        for i in range(len(somelist) - 1) :
            if somelist[i] > somelist[i+1] :
                k = somelist[i]
                somelist[i] = somelist[i+1]
                somelist[i+1] = k
                swapped_during_pass = True
            if showstep :
                print somelist
    return somelist

# <codecell>

print horrible_sort([3,1,4,2,5,0])
print horrible_sort([1,3,1,3,1,3,1], True)

# <headingcell level=3>

# Reading from files

# <codecell>

data_file = open("files/01_27_data.txt", "r")
for line in data_file :
    print line
data_file.close()

# <markdowncell>

# Now we can extract the data using the split() function as follows:

# <codecell>

data_file = open("files/01_27_data.txt", 'r')
for line in data_file :
    print line.split()
data_file.close()    

# <markdowncell>

# However the entries are strings and the first line has to be discarded. We do this as follows. i keeps track of which line we are in. If it is not the first line, we convert the strings into float an store them.

# <codecell>

data_file = open("files/01_27_data.txt", 'r')
i = 0
ice_cream_data = []
for line in data_file :
    if i > 0 :
        ice_cream_data.append((float(line.split()[0]), float(line.split()[1])))
    i += 1
data_file.close()
print ice_cream_data

# <markdowncell>

# Okay! Now that we have a list of tuples, we can find the correlation!

# <codecell>

data_file = open("files/01_27_data.txt", 'r')
i = 0
ice_cream_data = []
for line in data_file :
    if i > 0 :
        ice_cream_data.append((float(line.split()[0]), float(line.split()[1])))
    i += 1
data_file.close()
print "Correlation for the icecream data is %6.4f" % my_corr(ice_cream_data)