Hypotehesis testingAB testing
In 1:
Mission 1
Size294,478 rows x 5 columns
a using the data in the file abdata.csv find out if differences are significant or not with and without the Fisher correction.
b incrementally calculate the pvalue displaying its evolution
config InlineBackend.figureformatretina
from IPython.core.interactiveshell import InteractiveShell InteractiveShell.astnodeinteractivityall
import numpy as np
import pandas as pd
from scipy.stats import chi2, chi2contingency
import matplotlib.pyplot as plt import seaborn as sns
sns.setstylewhitegrid, fontscale1.9, palettetab10
In 2:
Out2:
userid timestamp
group landingpage converted
Out2:
0 851104
1 804228
2 661590
3 853541
4 864975
20170121 22:11:48.556739 20170112 08:01:45.159739 20170111 16:55:06.154213 20170108 18:28:03.143765 20170121 01:52:26.210827
control
control treatment treatment control
oldpage 0
oldpage 0 newpage 0 newpage 0
oldpage 1
userid
converted
294478.000000 0.119659 0.324563 0.000000 0.000000 0.000000 0.000000 1.000000
count 294478.000000 mean 787974.124733 std 91210.823776 min 630000.000000 25 709032.250000 50 787933.500000 75 866911.750000 max 945999.000000
First part of the exercise
First we load the dataset and take a look at it
abdfpd.readcsvabdata.csv abdf.head
abdf.describe
In 3:
Out3:
Out3:
count group
control treatment
group control treatment
147202
147276
noconv
129479
129762
conv
17723 17514
Out3: array129479, 17723, 129762, 17514
timeit
Immediately we look at counts group by group
and assign to a new dataframe with only the counts
remember that chi2contingency doesnt like dataframes and we have to use .values
abdfgroup,converted.groupbygroup.count.renamecolumnsconverted:count Tabdfabdfconverted0group,converted.groupbygroup.count.renamecolumnsconverted:noconv Tconvabdfabdfconverted1group,converted.groupbygroup.count
T
T.values
In 4:
Out4:
Out4:
count group
control treatment
group control treatment
147202
147276
noconv
129479
129762
conv
17723 17514
Out4: array129479, 17723, 129762, 17514
In 7:
Out7:
converted 0 1 group
control 129479 17723 treatment 129762 17514
Out7: array129479, 17723, 129762, 17514
In 9:
Chisquare test statistic 1.5160 pvalue 0.218 degrees of freedom 1
abdfgroup,converted.groupbygroup.count.renamecolumnsconverted:count
Tabdf.queryconverted0group,converted.groupbygroup.count.renamecolumnsconverted:noconv Tconvabdf.queryconverted1group,converted.groupbygroup.count
T
T.values
timeit
Tpabdf.pivottableindexgroup,columnsconverted,valuestimestamp,aggfunccount Tp
Tp.values
pvalue non significative we cannot falsify the null hypothesis that the two means come from different distributions
c2 , p, df, observedchi2contingencyT.values
printChisquare test statistic 0:.4f pvalue 1:.3f degrees of freedom 2:3d.formatc2,p,df
In 11:
Out11: array129479., 17723.,
129762., 17514., dtypefloat32
Out11: array7.74860419e304, 9.64699480e001, 9.17074499e001, , 2.18231272e001, 2.18485573e001, 2.18231612e001
Second partevolution of the pvalue
Nabdfconverted.count pvaluesnp.emptyN20
Tnp.zeros2, 2.astypenp.float32
for i,row in abdf.iterrows:
r0 if rowgroupcontrol else 1 c0 if rowconverted0 else 1 Tr,c1
if i20:
lets calculate the pvalue and put it in a table
c2 , pvaluesi20, df, observedchi2contingencyT pvalues
T
In 12:
Out12: Figure size 936792 with 0 Axes
Out12: Text0.5, 1.0, pvalues
Out12: matplotlib.axes.subplots.AxesSubplot at 0x7fafaac1b950
plt.figurefigsize13,11
plt.titlepvalues, fontsize30 sns.lineplotdatapvalues,np.onesN200.05,linewidth2.5, legendFalse
Reviews
There are no reviews yet.