Python statistics basics
First cell
import numpy as np
import pandas as pd
import scipy.stats as ss
import statsmodels.stats.weightstats as smw
import statsmodels.stats.proportion as smp
To import a csv file from a known drive
from google.colab import files #this creates a button you click to browse for the file
#once it is uploaded
uploaded = files.upload()
import io
df = pd.read_csv(io.BytesIO(uploaded[gifted.csv]))
df.head() #this is simply to check the first five rows and make sure it has been uploaded correctly.
Groupby
Groupby looks at the different values of a column and returns an attribute of the same or another column. Examples
table = df.groupby(Survived)[Survived].count()
#this creates an object table where it puts the output of looking at the column Survived and returns the count (how many) observations there are for each category. As Survived only has two attributes, it would tell you how many people survived and how many didnt.
table = df.groupby(Pclass)[Survived].sum()
#this creates an object table where it puts the output of looking at the column Pclass, creating separate groups for each Pclass (there were 3: first class (1), second class (2) and third class (3)) and returns the sum of another column Survived (not how many, thevalues of the cells, in this case for those who survived the number is 1, so in effect it tells you how many survived) .
table = df.groupby([Pclass, Survived])[Survived].count()
#This creates an object table where it puts the output of looking at the columns Pclass, creating groups for each Pclass, and then for each of those groups creating subgroups for Survived (first group 1st class and survived, second group 1st class and not survived, etc) and returns the number for each.
Creating a new column from another column
df[WithSib] = np.where(df[SibSp]== 0, 0, 1)
# This creates a new column WithSib by looking at the column SibSp and, if the content for each row of SinSp is 0 it assigns a value of 0 to the cell in the new column and, if the content in SinSp is different that 0 it assigns a value of 1.
Creating a Confidence Interval from a set of statistics
alpha = 0.05 # it creates an object and assigns it a value of 0.05
xbar = 29# it creates an object and assigns it a value of 29 (Note, for large numbers, for example 27500, make sure you DO NOT add a comma separating the thousands. Writing 27,500 creates an array in which xbar has two values, one of 27 and another of 500)
s= 4.5 # it creates an object and assigns it a value of 4.5
n= 27 # it creates an object and assigns it a value of 27
SE = s/np.sqrt(n)# it calculates the Standard Error based on previous values
Calculating the t value for a t distribution
tStar = ss.t.ppf(q=1-alpha/2, df=n-1) # it creates an object tStar and places it it (with the method ss.t.ppf) the t score given the alpha and degrees of freedom defined in the lines before
Alternatively (for large samples where n>30)
Calculating the z value for a normal distribution
alpha = 0.05 # it creates an object and assigns it a value of 0.05
zStar = ss.norm.ppf(1-alpha/2, 0, 1)# it creates an object zStar and places in it the z value for a two-tailed (thats why alpha is divided by 2, as you would have .025 on each end) normal distribution.
Additional use of ss.norm.pdf
The method above can also be utilised to calculate the value at a certain z in a normal or t distribution. For example, we want to know, in a test in which only the top 12% is accepted to a schools, and the average score is 90, with a standard deviation of 13, what is the score that would get you admitted:
Xvalue = ss.norm.ppf(.88, 90, 12)# .88 = 1-.12
Returns a value of 105.3
Calculating the probabilities of a value being left of a specific z value
pVal= ss.norm.cdf(z)
So, for a z = 1
This can be used for hypothesis testing using the formula.
To calculate, manually, a confidence interval:
lower = xbar tStar*SE # it calculates the lower limit of the interval
upper = xbar + tStar*SE # it calculates the upper limit of the interval
Creating an array of random numbers
x = ss.norm.rvs(loc=75,scale=15,size=50)
# Generates random variables. In this case the it generates 50 random numbers between 75 and 90. You can look at the actual stats (min, max, mean, etc, of the generated numbers) by writing ss.describe(x)
Creating the statistics for a data set
res = ssw.DescrStatsW(x)
# It creates statistics for a set of data and puts them in an object (res)
Hypothesis testing
tStat, pVal = res.ztest_mean(value=76, alternative=two-sided)
# It takes the statistics deposited in res and creates the t and p values for value (in this case its 76
Confidence interval (with method)
lower, upper = res.zconfint_mean(alpha=0.05, alternative=two-sided)
# the method .zconfint_mean takes the statistics of res and, given an alpha value and whether the interval is one or two-sided (it is usually two-sided), delivers two outputs, the first is the lower and the second the upper limit of a confidence interval.
Print method
print(t Since p-value ({0:5.3f}) is greater than alpha ({1:5.2f}), we fail to reject H0.format(pVal, alpha))
#
t : is equivalent to pressing tab, it creates a space of 5 blanks
({0:5.3f}) : it instructs to insert in that part of the text the value for the first position in the parenthesis after format (in this case the value in object pVal), with a maximum length of 5 digits and 3 fractionals.
({1:5.2f}) : it instructs to insert in that part of the text the value for the second position in the parenthesis after format (in this case the value in object alpha), with a maximum length of 5 digits and 2 fractionals.
Reviews
There are no reviews yet.