Advanced Research Methods 2020: A guide to R-studio
Tests of association:
IV qualitative, DV quantitative
Tests of correlation:
IV & DV quantitative
Cause-consequence relationship:
IV & DV quantitative
IV & DV qualitative
Parametric
Non parametric
Parametric
Non parametric
2 variables
T-test
Mann-Whitley-U
Wilcox
Cor test
Spearman test
Regression
Chi-square
Fishers test
3+ variables
Anova
Kruskal
IMPORTING DATA
From excel:
Upper right hand quadrant
Environment import data set from Excel
When using formulas, identify your variables as (file-name$column-name)
By hand:
>variable1<-c(#,#,#,#,#,#)>variable2<-c(#,#,#,#,#,#)TESTING TWO VARIABLES FOR ASSOCIATIONRun the Shapiro test for normal distribution on each variable:> shapiro.test(variable1)
> shapiro.test(variable2)
If p-value>0.05 for both variables, there is no significant difference between their distribution and the normal perfect curve. Proceed to variance test in 2.2.
If p-value<0.05 for either variable, there is a significant difference between its distribution and the normal perfect curve. The data is not parametric, so the T-test cannot be run. Proceed to Mann-Whitley-U or Wilcox test instead. See 2.4.Run the variance test on both variables to check for equality of variances. Equality of variances means that the average distance of all results versus the mean are not different when we compare them.>var.test(variable1,variable2)
If p-value>0.05, the ratios are equal and there is equality of variances. The data is parametric. Proceed to T-test in 2.3.
If p-value<0.05, the ratios are not equal and there is no equality of variances. The data is not parametric. Proceed to Mann-Whitley-U or Wilcox test. See 2.4.If the data is parametric as shown in the shapiro and variance tests, run the T-test.REGULAR T-TEST>t.test(variable1,variable2)
OR MATCHED T-TEST for paired data (when the same subject is tested twice, before and after some intervention)
>t.test(before,after,paired=TRUE)
If p-value>0.05, the association is not significant. You may need more data.
If p-value<0.05, the association is significant. Yay!When p-value = …e-X, X is the number of zeros which precede the given number. Its very significant.If the data is NOT parametric as shown in the shapiro and variance tests, run the Wilcox test. The Mann-Whitley-U test is another option, but we didnt go over it in class.FOR REGULAR DATA>wilcox.test(variable1,variable2,paired=F)
(F=false)
FOR PAIRED DATA
>wilcox.test(variable1,variable2,paired=TRUE)
TESTING THREE OR MORE VARIABLES FOR ASSOCIATION
First reformat your data into two columns in Excel. You will need to import both the original (storage) version and the reformatted version into R-studio.
STORAGE VERSION:
Number of words
Spain
UK
Russia
112
152
178
143
161
167
143
144
181
REFORMATTED VERSION:
No words
Nationality
112
Spain
143
Spain
143
Spain
152
UK
161
UK
144
UK
178
Russia
167
Russia
181
Russia
Run the Shapiro test with the original/storage version of the data to check for normal distribution:
>shapiro.test(independent-variable1)
>shapiro.test(independent-variable2)
>shapiro.test(independent-variable3)
For example:
>shapiro.test(N_words$Spain)
>shapiro.test(N_words$UK)
>shapiro.test(N_words$Russia)
If p-value>0.05 for all variables, there is no significant difference between their distribution and the normal perfect curve. Proceed to Bartlett test in 3.3.
If p-value<0.05 for any variable, there is a significant difference between its distribution and the normal perfect curve. The data is not parametric, so the Anova test cannot be run. Proceed to Kruskal test in 3.5.Run the Bartlett test with the reformatted version of the data to check for equality of variances:>bartlett.test(dependent-variable~independent-variable)
For example:
>bartlett.test(N_words_2$No words~N_words_2$Nationality)
If p-value>0.05, the ratios are equal and there is equality of variances. The data is parametric. Proceed to Anova test in 3.4.
If p-value<0.05, the ratios are not equal and there is no equality of variances. The data is not parametric. Proceed to Kruskal test in 3.5.Run the Anova test with the reformatted version of the data to check for statistical significance of parametric data:>name1<-lm(dependent-variable~independent-variable)*>anova(name1)
For example:
>Res_anova<-lm(N_words_2$No words~N_words_2$Nationality)>anova(res_anova)
*Note: lm stands for linear model
When reporting data, we state: The anova test gave the F value , degrees of freedom, and the result Pr(>F)*
*Pr(>F) is the significance level. It is signaled with asterisks at the bottom of the results of the Anova test.
If you wish to do post-hoc tests, you may choose to do an analysis of variables instead of the linear model:
>name2<-aov(dependent-variable~independent-variable)>anova(name2)
>name2
>summary(name2)
For example:
>Res_anova2<-aov(N_words_2$No words~N_words_2$Nationality)>anova(Res_anova2)
>Res_anova2
>summary(Res_anova2)
The Tukey HSD post-hoc test shows the differences between each group. It tells us which of the groups are more salient and which are related to one another.
>TukeyHSD(name2)
For example:
>TukeyHSD(Res_anova2)
Run the Kruskal test with the reformatted version of the data to check for statistical significance of non parametric data:
>kruskal.test(dependent-variable~independent-variable)
For example:
>kruskal.test(schools$mistakes~schools$school)
If p-value>0.05, the association is not significant. You may need more data.
If p-value<0.05, the association is significant. Yay!TESTING TWO VARIABLES FOR CORRELATIONTests for correlation are often used in social sciences. They make no cause-effect assumption because we use them when no experiment or intervention has taken place. We simply select two variables and run tests to see if there is a relationship between the two (e.g. length of study abroad and language level). There is no control group or expected direct influence.First import your dataset to R-studio, check out its structure and other basic information:>str(Filename)
>name3<-(data.frame=Filename)*>name3**
*Nothing should happen here. If you dont capitalize properly, youll get an error message.
**This will produce a table.
For example:
>str(Stay)
>relevance.stay<-(data.frame=Stay)>relevance.stay
Now run the Shapiro test to check for normal distribution.
>shapiro.test(variable1)
>shapiro.test(variable2)
For example:
>shapiro.test(Stay$length_of_stay)
>shapiro.test(Stay$marks)
If p-value>0.05 for both variables, there is no significant difference between their distribution and the normal perfect curve. The data is parametric. Proceed to cor test in 4.3.
If p-value<0.05 for either variable, there is a significant difference between its distribution and the normal perfect curve. The data is not parametric, so the cor test cannot be run. Proceed to Spearman test in 4.4.If the data is parametric, run the cor test.>cor.test(variable1,variable2)
For example:
>cor.test(Stay$length_of_stay,Stay$marks)
The cor result must be >0.7 to be significant.
4.4. If the data is not parametric, run the Spearman test.
>cor.test(~name3$variable1 + name3$variable2,data=name3,method=spearman,continuity=FALSE,conf.level=0.95)*
*The name will depend on how you named the data in 4.1.
For example:
>relation<-data.frame(metaphors_adjectives)**>relation
>shapiro.test(Metaphors_adjectives$metaphor)
>shapiro.test(Metaphors_adjectives$adjective)
>cor.test(~relation$Metaphor + relation$Adjective,data=relation,method=spearman,continuity=FALSE,conf.level=0.95)
The RHO result must be >0.7 to be significant.
**This is the naming process from 4.1.
TESTING VARIABLES FOR A CAUSE-CONSEQUENCE RELATIONSHIP
To test variables for a cause-consequence relationship, we use regression.
>str(filename)
>name4=lm(variable1~variable2)
>name4$coefficients
>name4$residuals
>lm(formula=variable1~variable2)
>summary(name4)
For example:
>str(corpus_regr)
>modelX=lm(corpus_regr$X~corpus_regr$Y)
>modelX$coefficients
>modelX$residuals
>lm(formula=corpus_regr$X~corpus_regr$Y)
>summary(modelX)
Report on adjusted R-squared number and p-value (Pr(>|t|), second row). The significance is signaled with asterisks at the bottom of the table generated by the summary.
QUALITATIVE DATA: ASSESSING SIGNIFICANCE
When both the independent and dependent variables are qualitative, we devise a contingency table, in which each element can only occupy one cell. The numbers in each cell represent the number of cases (instances) of the given combination of variables.
This is one of the most frequent tests in linguistics because we normally analyze data in categories.
Turn
Beginning
Middle
End
Yes
13
2
1
No
2
13
14
To calculate significance, we can enter our data into the online calculator on Quantspy.org and run the Chi Square test, as long as:
No cell has 0 cases
Each cell has at least 5 cases
If any cell has less than 5 cases, we can run as Fishers test, as in this example from class:
China
Spain
US
Canada
Iran
Slovenia
Male
0
2
0
1
0
0
Female
3
2
1
0
1
1
GRAPHING YOUR DATA
Graph the number of cases of individual variables in a histogram.
>hist(variable1,breaks=13,density=8,angle=45,col=color-name,main=title-of-histogram,xlab= x-axis-label,ylab=y-axis-label)
>hist(variable2,breaks=13,density=8,angle=45,col=color-name,main=title-of-histogram,xlab= x-axis-label,ylab=y-axis-label)
Breaks: Intervals (number of bars, groups) into which the data is divided to gauge frequency
Density: Shading of the bars
Angle: Degree of the angle of the lines used to shade the bars
Col: Color of the shading
Main: Title of the histogram
Xlab: X-axis label
Ylab: Y-axis label
Graph the independent variable against the dependent variable in a box plot.
>boxplot(variable1,variable2,ylab= y-axis-label,main=title-of-boxplot, col=color-name, names=c(name-of-variable1,name-of-variable2))
Reviews
There are no reviews yet.