Data Management and Exploratory Analysis in R
Davide Benedetti
Basic Functionalities
Lets import a dataset and perform some analysis on it. We can use the read.table family functions to load from a variety of standard data formats, including txt and csv.
?read.table # to get more info on function read.table
mydata <- read.csv(“data/eaef21.csv”) # load the dataset eaef21.csv into R class(mydata) # class will show us the type of object ‘mydata’## [1] “data.frame”str(mydata) # str() gives us the structure, ## ‘data.frame’:540 obs. of51 variables:##$ ID: int5531 2658 5365 4468 3142 2170 2344 4583 2517 3563 …##$ FEMALE: int0 0 0 0 0 0 0 0 0 0 …##$ MALE: int1 1 1 1 1 1 1 1 1 1 …##$ ETHBLACK: int0 0 0 0 0 1 0 0 0 0 …##$ ETHHISP : int0 1 0 0 0 0 0 0 0 0 …##$ ETHWHITE: int1 0 1 1 1 0 1 1 1 1 …##$ AGE : int45 40 38 43 38 39 40 37 39 42 …##$ S : int12 12 15 13 18 16 14 12 12 12 …##$ EDUCPROF: int0 0 0 0 0 0 0 0 0 0 …##$ EDUCPHD : int0 0 0 0 0 0 0 0 0 0 …##$ EDUCMAST: int0 0 0 0 1 0 0 0 0 0 …##$ EDUCBA: int0 0 0 0 0 0 0 0 0 0 …##$ EDUCAA: int0 0 1 0 0 0 1 0 0 0 …##$ EDUCHSD : int1 0 0 1 0 0 0 1 1 1 …##$ EDUCDO: int0 1 0 0 0 0 0 0 0 0 …##$ SINGLE: int0 0 0 0 0 0 1 1 0 0 …##$ MARRIED : int1 0 0 1 1 1 0 0 1 1 …##$ DIVORCED: int0 1 1 0 0 0 0 0 0 0 …##$ FAITHN: int0 1 0 0 0 0 0 0 0 0 …##$ FAITHP: int1 0 0 1 1 1 1 1 1 1 …##$ FAITHC: int0 0 0 0 0 0 0 0 0 0 …##$ FAITHJ: int0 0 0 0 0 0 0 0 0 0 …##$ FAITHO: int0 0 1 0 0 0 0 0 0 0 …##$ ASVAB01 : int58 32 42 66 64 62 50 56 60 46 …##$ ASVAB02 : int64 39 40 55 65 53 32 62 59 57 …##$ ASVAB03 : int52 29 37 59 61 51 44 52 55 28 …##$ ASVAB04 : int56 29 38 53 62 59 38 44 62 32 …##$ ASVAB05 : int54 27 42 51 47 58 33 41 58 47 …##$ ASVAB06 : int56 22 45 52 34 45 29 47 46 46 …##$ ASVABC: num60.9 33.6 38.8 57.1 65.5 …##$ HEIGHT: int67 67 69 72 76 69 65 69 72 70 …##$ WEIGHT85: int160 185 135 200 185 180 150 125 195 180 …##$ WEIGHT02: int200 205 185 250 220 215 145 140 240 210 …##$ SM: int8 5 11 12 16 12 12 8 12 10 …##$ SF: int8 5 12 16 20 12 12 13 12 8 …##$ SIBLINGS: int11 3 3 2 1 2 1 10 3 9 …##$ LIBRARY : int0 0 1 1 1 0 1 1 1 1 …##$ POV78 : int0 1 0 0 0 0 0 0 NA 0 …##$ EXP : num22.4 8.9 13.2 18.2 13.8 …##$ EARNINGS: num53.4 8 24 29.5 32 …##$ HOURS : int45 40 40 40 54 40 48 40 40 40 …##$ TENURE: num2.75 2.385 5.75 6.135 0.827 …##$ COLLBARG: int0 0 1 1 0 0 0 0 1 1 …##$ CATGOV: int0 0 0 0 1 0 0 0 1 0 …##$ CATPRI: int1 1 1 1 0 1 1 1 0 1 …##$ CATSE : int0 0 0 0 0 0 0 0 0 0 …##$ URBAN : int0 0 1 1 1 1 1 1 1 1 …##$ REGNE : int0 0 0 0 1 0 0 0 0 0 …##$ REGNC : int0 0 1 0 0 0 0 0 1 1 …##$ REGW: int0 0 0 1 0 0 0 1 0 0 …##$ REGS: int1 1 0 0 0 1 1 0 0 0 …# i.e. the elements within ‘mydata’ and how to access them# (e.g., with $ or @)head(mydata)# shows the first 6 rows of your dataset## ID FEMALE MALE ETHBLACK ETHHISP ETHWHITE AGES EDUCPROF EDUCPHD## 1 5531010 0145 120 0## 2 2658010 1040 120 0## 3 5365010 0138 150 0## 4 4468010 0143 130 0## 5 3142010 0138 180 0## 6 2170011 0039 160 0## EDUCMAST EDUCBA EDUCAA EDUCHSD EDUCDO SINGLE MARRIED DIVORCED FAITHN## 1000 100 100## 2000 010 011## 3001 000 010## 4000 100 100## 5100 000 100## 6000 000 100## FAITHP FAITHC FAITHJ FAITHO ASVAB01 ASVAB02 ASVAB03 ASVAB04 ASVAB05## 110005864525654## 200003239292927## 300014240373842## 410006655595351## 510006465616247## 610006253515958## ASVAB06 ASVABC HEIGHT WEIGHT85 WEIGHT02 SM SF SIBLINGS LIBRARY POV78## 156 60.89985 6716020088 11 0 0## 222 33.63790 67185205553 0 1## 345 38.81767 69135185 11 123 1 0## 452 57.08318 72200250 12 162 1 0## 534 65.53439 76185220 16 201 1 0## 645 55.44746 69180215 12 122 0 0## EXP EARNINGS HOURSTENURE COLLBARG CATGOV CATPRI CATSE URBAN## 1 22.38461553.4145 2.7500000001 0 0## 28.903846 8.0040 2.3846154001 0 0## 3 13.25000024.0040 5.7500000101 0 1## 4 18.25000029.5040 6.1346154101 0 1## 5 13.76923132.0554 0.8269231010 0 1## 6 11.69230714.7340 4.2884617001 0 1## REGNE REGNC REGW REGS## 1 0 001## 2 0 001## 3 0 100## 4 0 010## 5 1 000## 6 0 001tail(mydata)# shows the last 6 rows of your dataset## ID FEMALE MALE ETHBLACK ETHHISP ETHWHITE AGES EDUCPROF EDUCPHD## 535 5614101 0037 120 0## 536682101 0039 120 0## 537 1564100 0142 140 0## 538 4373100 0138 120 0## 539 1444100 1040 130 0## 540 1056100 0142 120 0## EDUCMAST EDUCBA EDUCAA EDUCHSD EDUCDO SINGLE MARRIED DIVORCED FAITHN## 535000 100 100## 536000 101 000## 537010 000 100## 538000 100 100## 539000 101 000## 540000 100 010## FAITHP FAITHC FAITHJ FAITHO ASVAB01 ASVAB02 ASVAB03 ASVAB04 ASVAB05## 53510004034414132## 53610004438465051## 53700015858606252## 53810003640445343## 53901004846436253## 54000014834385044## ASVAB06 ASVABC HEIGHT WEIGHT85 WEIGHT02 SM SF SIBLINGS LIBRARY POV78## 53537 37.45457 60125175 12 127 0 1## 53650 43.45220 6814019098 12 1 1## 53756 61.44509 67125155 13 142 1 0## 53850 44.81530 62124125 12 202 0 0## 53958 50.26769 62130190 10 112 1 0## 54054 39.09029 62 97 95 11 122 1 0## EXP EARNINGS HOURS TENURE COLLBARG CATGOV CATPRI CATSE URBAN## 535 14.942307 9.75400.8076923001 0 1## 5367.307693 9.50407.7307692010 0 1## 537 19.28846217.30405.3269229001 0 1## 538 21.23077027.5251 10.4615383001 0 1## 539 18.076923 3.88451.1346154001 0 1## 540 18.80769211.13406.3076925101 0 1## REGNE REGNC REGW REGS## 535 0 001## 536 0 100## 537 0 010## 538 0 010## 539 0 001## 540 0 100summary(mydata) # gives you a simple summary of your dataset (not very nice looking)##IDFEMALE MALEETHBLACK ##Min. : 18 Min. :0.0 Min. :0.0 Min. :0.0000##1st Qu.: 1525 1st Qu.:0.0 1st Qu.:0.0 1st Qu.:0.0000##Median : 2788 Median :0.5 Median :0.5 Median :0.0000##Mean : 3330 Mean :0.5 Mean :0.5 Mean :0.1056##3rd Qu.: 4399 3rd Qu.:1.0 3rd Qu.:1.0 3rd Qu.:0.0000##Max. :12110 Max. :1.0 Max. :1.0 Max. :1.0000#### ETHHISP ETHWHITE AGES##Min. :0.00000 Min. :0.0000 Min. :37.00 Min. : 7.00##1st Qu.:0.00000 1st Qu.:1.0000 1st Qu.:39.00 1st Qu.:12.00##Median :0.00000 Median :1.0000 Median :41.00 Median :13.00##Mean :0.05185 Mean :0.8426 Mean :40.92 Mean :13.67##3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:43.00 3rd Qu.:16.00##Max. :1.00000 Max. :1.0000 Max. :45.00 Max. :20.00#### EDUCPROF EDUCPHDEDUCMAST EDUCBA##Min. :0.000000 Min. :0.000000 Min. :0.00000 Min. :0.0000##1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.0000##Median :0.000000 Median :0.000000 Median :0.00000 Median :0.0000##Mean :0.007407 Mean :0.001852 Mean :0.05185 Mean :0.1944##3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.0000##Max. :1.000000 Max. :1.000000 Max. :1.00000 Max. :1.0000####EDUCAA EDUCHSD EDUCDOSINGLE##Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.0000##1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000##Median :0.00000 Median :1.0000 Median :0.00000 Median :0.0000##Mean :0.08889 Mean :0.5481 Mean :0.07778 Mean :0.1481##3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.0000##Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.0000## ## MARRIEDDIVORCEDFAITHNFAITHP ##Min. :0.0000 Min. :0.0000 Min. :-3.0000 Min. :-3.0000##1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000##Median :1.0000 Median :0.0000 Median : 0.0000 Median : 1.0000##Mean :0.6574 Mean :0.1944 Mean : 0.0463 Mean : 0.4963##3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.: 0.0000 3rd Qu.: 1.0000##Max. :1.0000 Max. :1.0000 Max. : 1.0000 Max. : 1.0000## ##FAITHCFAITHJFAITHO ASVAB01##Min. :-3.0000 Min. :-3.000000 Min. :-3.0000 Min. :22##1st Qu.: 0.0000 1st Qu.: 0.000000 1st Qu.: 0.0000 1st Qu.:44##Median : 0.0000 Median : 0.000000 Median : 0.0000 Median :50##Mean : 0.3204 Mean :-0.003704 Mean : 0.1111 Mean :50##3rd Qu.: 1.0000 3rd Qu.: 0.000000 3rd Qu.: 0.0000 3rd Qu.:56##Max. : 1.0000 Max. : 1.000000 Max. : 1.0000 Max. :68#### ASVAB02 ASVAB03 ASVAB04 ASVAB05 ##Min. :31.00 Min. :24.00 Min. :20.00 Min. :20.00##1st Qu.:42.00 1st Qu.:44.00 1st Qu.:47.00 1st Qu.:46.00##Median :50.00 Median :52.00 Median :53.00 Median :52.00##Mean :50.48 Mean :49.71 Mean :50.36 Mean :50.78##3rd Qu.:58.00 3rd Qu.:57.00 3rd Qu.:56.00 3rd Qu.:59.00##Max. :66.00 Max. :61.00 Max. :62.00 Max. :62.00## ## ASVAB06ASVABCHEIGHT WEIGHT85##Min. :22.00 Min. :25.46 Min. :48.00 Min. : 95.0##1st Qu.:45.00 1st Qu.:44.54 1st Qu.:64.00 1st Qu.:129.0##Median :51.00 Median :52.72 Median :67.00 Median :155.0##Mean :50.22 Mean :51.36 Mean :67.67 Mean :156.7##3rd Qu.:57.00 3rd Qu.:58.72 3rd Qu.:71.00 3rd Qu.:178.0##Max. :72.00 Max. :66.08 Max. :80.00 Max. :322.0## ## WEIGHT02 SMSF SIBLINGS ##Min. : 95 Min. : 0.00 Min. : 0.00 Min. : 0.000##1st Qu.:150 1st Qu.:11.00 1st Qu.:10.00 1st Qu.: 2.000##Median :180 Median :12.00 Median :12.00 Median : 3.000##Mean :184 Mean :11.58 Mean :11.84 Mean : 3.274##3rd Qu.:210 3rd Qu.:12.00 3rd Qu.:14.00 3rd Qu.: 4.000##Max. :400 Max. :20.00 Max. :20.00 Max. :13.000#### LIBRARY POV78 EXPEARNINGS ##Min. :0.0000 Min. :0.0000 Min. : 1.154 Min. :2.13##1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:14.596 1st Qu.: 10.76##Median :1.0000 Median :0.0000 Median :17.510 Median : 16.00##Mean :0.7796 Mean :0.1306 Mean :16.900 Mean : 19.64##3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:20.197 3rd Qu.: 23.16##Max. :1.0000 Max. :1.0000 Max. :23.558 Max. :120.19## NA’s :27##HOURS TENURECOLLBARG CATGOV##Min. :10.00 Min. : 0.01923 Min. :0.000 Min. :0.0000##1st Qu.:40.00 1st Qu.: 1.93750 1st Qu.:0.000 1st Qu.:0.0000##Median :40.00 Median : 4.69231 Median :0.000 Median :0.0000##Mean :40.53 Mean : 7.03397 Mean :0.187 Mean :0.2278##3rd Qu.:45.00 3rd Qu.:10.98077 3rd Qu.:0.000 3rd Qu.:0.0000##Max. :60.00 Max. :24.94231 Max. :1.000 Max. :1.0000## ##CATPRI CATSE URBANREGNE ##Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000##1st Qu.:1.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000##Median :1.0000 Median :0.00000 Median :1.0000 Median :0.0000##Mean :0.7537 Mean :0.01852 Mean :0.7241 Mean :0.1426##3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.0000##Max. :1.0000 Max. :1.00000 Max. :2.0000 Max. :1.0000####REGNCREGW REGS##Min. :0.000 Min. :0.0000 Min. :0.000##1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.000##Median :0.000 Median :0.0000 Median :0.000##Mean :0.337 Mean :0.1574 Mean :0.363##3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:1.000##Max. :1.000 Max. :1.0000 Max. :1.000## nrow(mydata)# gives the number of rows## [1] 540ncol(mydata)# gives the number of columns## [1] 51The object mydata is a data.frame, i.e.a dataset with named columns. We can perform operations by rows and by columns with function apply.# this will apply function class to each column of the datasetapply(mydata, # the dataset to work onMARGIN = 2, # 1 to work by rows and 2 by columnsFUN = class) # the function to apply##IDFEMALEMALEETHBLACK ETHHISPETHWHITE AGE ## “numeric” “numeric” “numeric” “numeric” “numeric” “numeric” “numeric” ## SEDUCPROF EDUCPHDEDUCMASTEDUCBAEDUCAA EDUCHSD ## “numeric” “numeric” “numeric” “numeric” “numeric” “numeric” “numeric” ##EDUCDOSINGLE MARRIEDDIVORCEDFAITHNFAITHPFAITHC ## “numeric” “numeric” “numeric” “numeric” “numeric” “numeric” “numeric” ##FAITHJFAITHO ASVAB01 ASVAB02 ASVAB03 ASVAB04 ASVAB05 ## “numeric” “numeric” “numeric” “numeric” “numeric” “numeric” “numeric” ## ASVAB06ASVABCHEIGHTWEIGHT85WEIGHT02SMSF ## “numeric” “numeric” “numeric” “numeric” “numeric” “numeric” “numeric” ##SIBLINGS LIBRARY POV78 EXPEARNINGS HOURSTENURE ## “numeric” “numeric” “numeric” “numeric” “numeric” “numeric” “numeric” ##COLLBARGCATGOVCATPRI CATSE URBAN REGNE REGNC ## “numeric” “numeric” “numeric” “numeric” “numeric” “numeric” “numeric” ##REGWREGS ## “numeric” “numeric”R can also load from different data sources. For example, package foreign can read stata datasets.## install packages foreignif(!require(foreign)){install.packages(“foreign”)}library(foreign)# We can now read STATA file using the function read.dta?read.dta# Let’s load the datasetmydata_new <- read.dta(“data/eaef21.dta”)# head and summary functions give an overview of what’s inside your datasethead(mydata_new)## ID FEMALE MALE ETHBLACK ETHHISP ETHWHITE AGES EDUCPROF EDUCPHD## 1 5531010 0145 120 0## 2 2658010 1040 120 0## 3 5365010 0138 150 0## 4 4468010 0143 130 0## 5 3142010 0138 180 0## 6 2170011 0039 160 0## EDUCMAST EDUCBA EDUCAA EDUCHSD EDUCDO SINGLE MARRIED DIVORCED FAITHN## 1000 100 100## 2000 010 011## 3001 000 010## 4000 100 100## 5100 000 100## 6000 000 100## FAITHP FAITHC FAITHJ FAITHO ASVAB01 ASVAB02 ASVAB03 ASVAB04 ASVAB05## 110005864525654## 200003239292927## 300014240373842## 410006655595351## 510006465616247## 610006253515958## ASVAB06 ASVABC HEIGHT WEIGHT85 WEIGHT02 SM SF SIBLINGS LIBRARY POV78## 156 60.89985 6716020088 11 0 0## 222 33.63790 67185205553 0 1## 345 38.81767 69135185 11 123 1 0## 452 57.08318 72200250 12 162 1 0## 534 65.53439 76185220 16 201 1 0## 645 55.44746 69180215 12 122 0 0## EXP EARNINGS HOURSTENURE COLLBARG CATGOV CATPRI CATSE URBAN## 1 22.38461553.4145 2.7500000001 0 0## 28.903846 8.0040 2.3846154001 0 0## 3 13.25000024.0040 5.7500000101 0 1## 4 18.25000029.5040 6.1346154101 0 1## 5 13.76923132.0554 0.8269231010 0 1## 6 11.69230714.7340 4.2884617001 0 1## REGNE REGNC REGW REGS## 1 0 001## 2 0 001## 3 0 100## 4 0 010## 5 1 000## 6 0 001summary(mydata_new)##IDFEMALE MALEETHBLACK ##Min. : 18 Min. :0.0 Min. :0.0 Min. :0.0000##1st Qu.: 1525 1st Qu.:0.0 1st Qu.:0.0 1st Qu.:0.0000##Median : 2788 Median :0.5 Median :0.5 Median :0.0000##Mean : 3330 Mean :0.5 Mean :0.5 Mean :0.1056##3rd Qu.: 4399 3rd Qu.:1.0 3rd Qu.:1.0 3rd Qu.:0.0000##Max. :12110 Max. :1.0 Max. :1.0 Max. :1.0000#### ETHHISP ETHWHITE AGES##Min. :0.00000 Min. :0.0000 Min. :37.00 Min. : 7.00##1st Qu.:0.00000 1st Qu.:1.0000 1st Qu.:39.00 1st Qu.:12.00##Median :0.00000 Median :1.0000 Median :41.00 Median :13.00##Mean :0.05185 Mean :0.8426 Mean :40.92 Mean :13.67##3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:43.00 3rd Qu.:16.00##Max. :1.00000 Max. :1.0000 Max. :45.00 Max. :20.00#### EDUCPROF EDUCPHDEDUCMAST EDUCBA##Min. :0.000000 Min. :0.000000 Min. :0.00000 Min. :0.0000##1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.0000##Median :0.000000 Median :0.000000 Median :0.00000 Median :0.0000##Mean :0.007407 Mean :0.001852 Mean :0.05185 Mean :0.1944##3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.0000##Max. :1.000000 Max. :1.000000 Max. :1.00000 Max. :1.0000####EDUCAA EDUCHSD EDUCDOSINGLE##Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.0000##1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000##Median :0.00000 Median :1.0000 Median :0.00000 Median :0.0000##Mean :0.08889 Mean :0.5481 Mean :0.07778 Mean :0.1481##3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.0000##Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.0000## ## MARRIEDDIVORCEDFAITHNFAITHP ##Min. :0.0000 Min. :0.0000 Min. :-3.0000 Min. :-3.0000##1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000##Median :1.0000 Median :0.0000 Median : 0.0000 Median : 1.0000##Mean :0.6574 Mean :0.1944 Mean : 0.0463 Mean : 0.4963##3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.: 0.0000 3rd Qu.: 1.0000##Max. :1.0000 Max. :1.0000 Max. : 1.0000 Max. : 1.0000## ##FAITHCFAITHJFAITHO ASVAB01##Min. :-3.0000 Min. :-3.000000 Min. :-3.0000 Min. :22##1st Qu.: 0.0000 1st Qu.: 0.000000 1st Qu.: 0.0000 1st Qu.:44##Median : 0.0000 Median : 0.000000 Median : 0.0000 Median :50##Mean : 0.3204 Mean :-0.003704 Mean : 0.1111 Mean :50##3rd Qu.: 1.0000 3rd Qu.: 0.000000 3rd Qu.: 0.0000 3rd Qu.:56##Max. : 1.0000 Max. : 1.000000 Max. : 1.0000 Max. :68#### ASVAB02 ASVAB03 ASVAB04 ASVAB05 ##Min. :31.00 Min. :24.00 Min. :20.00 Min. :20.00##1st Qu.:42.00 1st Qu.:44.00 1st Qu.:47.00 1st Qu.:46.00##Median :50.00 Median :52.00 Median :53.00 Median :52.00##Mean :50.48 Mean :49.71 Mean :50.36 Mean :50.78##3rd Qu.:58.00 3rd Qu.:57.00 3rd Qu.:56.00 3rd Qu.:59.00##Max. :66.00 Max. :61.00 Max. :62.00 Max. :62.00## ## ASVAB06ASVABCHEIGHT WEIGHT85##Min. :22.00 Min. :25.46 Min. :48.00 Min. : 95.0##1st Qu.:45.00 1st Qu.:44.54 1st Qu.:64.00 1st Qu.:129.0##Median :51.00 Median :52.72 Median :67.00 Median :155.0##Mean :50.22 Mean :51.36 Mean :67.67 Mean :156.7##3rd Qu.:57.00 3rd Qu.:58.72 3rd Qu.:71.00 3rd Qu.:178.0##Max. :72.00 Max. :66.08 Max. :80.00 Max. :322.0## ## WEIGHT02 SMSF SIBLINGS ##Min. : 95 Min. : 0.00 Min. : 0.00 Min. : 0.000##1st Qu.:150 1st Qu.:11.00 1st Qu.:10.00 1st Qu.: 2.000##Median :180 Median :12.00 Median :12.00 Median : 3.000##Mean :184 Mean :11.58 Mean :11.84 Mean : 3.274##3rd Qu.:210 3rd Qu.:12.00 3rd Qu.:14.00 3rd Qu.: 4.000##Max. :400 Max. :20.00 Max. :20.00 Max. :13.000#### LIBRARY POV78 EXPEARNINGS ##Min. :0.0000 Min. :0.0000 Min. : 1.154 Min. :2.13##1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:14.596 1st Qu.: 10.76##Median :1.0000 Median :0.0000 Median :17.510 Median : 16.00##Mean :0.7796 Mean :0.1306 Mean :16.900 Mean : 19.64##3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:20.197 3rd Qu.: 23.16##Max. :1.0000 Max. :1.0000 Max. :23.558 Max. :120.19## NA’s :27##HOURS TENURECOLLBARG CATGOV##Min. :10.00 Min. : 0.01923 Min. :0.000 Min. :0.0000##1st Qu.:40.00 1st Qu.: 1.93750 1st Qu.:0.000 1st Qu.:0.0000##Median :40.00 Median : 4.69231 Median :0.000 Median :0.0000##Mean :40.53 Mean : 7.03397 Mean :0.187 Mean :0.2278##3rd Qu.:45.00 3rd Qu.:10.98077 3rd Qu.:0.000 3rd Qu.:0.0000##Max. :60.00 Max. :24.94231 Max. :1.000 Max. :1.0000## ##CATPRI CATSE URBANREGNE ##Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000##1st Qu.:1.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000##Median :1.0000 Median :0.00000 Median :1.0000 Median :0.0000##Mean :0.7537 Mean :0.01852 Mean :0.7241 Mean :0.1426##3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.0000##Max. :1.0000 Max. :1.00000 Max. :2.0000 Max. :1.0000####REGNCREGW REGS##Min. :0.000 Min. :0.0000 Min. :0.000##1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.000##Median :0.000 Median :0.0000 Median :0.000##Mean :0.337 Mean :0.1574 Mean :0.363##3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:1.000##Max. :1.000 Max. :1.0000 Max. :1.000## # remove R object from environmentrm(mydata_new)We can show some more summary statistics with package stargather.Load the Data## install packages stargazerif(!require(stargazer)){install.packages(“stargazer”)}library(stargazer)stargazer(mydata, type = “text”, iqr = T, median = T)## ## ==========================================================================## StatisticNMeanSt. Dev. Min Pctl(25) MedianPctl(75) Max## ————————————————————————–## ID540 3,330.191 2,684.020 181,525 2,787.5 4,398.812,110 ## FEMALE540 0.500 0.500 0 0 0.511 ## MALE540 0.500 0.500 0 0 0.511 ## ETHBLACK540 0.106 0.308 0 00 01 ## ETHHISP 540 0.052 0.222 0 00 01 ## ETHWHITE540 0.843 0.365 0 11 11 ## AGE 54040.919 2.323 373941 4345 ## S 54013.672 2.438 7 1213 1620 ## EDUCPROF540 0.007 0.086 0 00 01 ## EDUCPHD 540 0.002 0.043 0 00 01 ## EDUCMAST540 0.052 0.222 0 00 01 ## EDUCBA540 0.194 0.396 0 00 01 ## EDUCAA540 0.089 0.285 0 00 01 ## EDUCHSD 540 0.548 0.498 0 01 11 ## EDUCDO540 0.078 0.268 0 00 01 ## SINGLE540 0.148 0.356 0 00 01 ## MARRIED 540 0.657 0.475 0 01 11 ## DIVORCED540 0.194 0.396 0 00 01 ## FAITHN540 0.046 0.258 -300 01 ## FAITHP540 0.496 0.522 -301 11 ## FAITHC540 0.320 0.490 -300 11 ## FAITHJ540-0.004 0.136 -300 01 ## FAITHO540 0.111 0.348 -300 01 ## ASVAB01 54049.996 9.540 224450 5668 ## ASVAB02 54050.476 9.808 314250 5866 ## ASVAB03 54049.709 9.376 244452 5761 ## ASVAB04 54050.356 9.677 204753 5662 ## ASVAB05 54050.780 9.470 204652 5962 ## ASVAB06 54050.222 9.530 224551 5772 ## ASVABC54051.363 9.568 25.45944.54352.721 58.71966.080 ## HEIGHT54067.674 4.238 486467 7180 ## WEIGHT85540156.748 34.880 95 129155 178322## WEIGHT02540184.013 44.828 95 150180 210400## SM54011.580 2.816 0 1112 1220 ## SF54011.837 3.537 0 1012 1420 ## SIBLINGS540 3.274 2.188 0 23 4 13 ## LIBRARY 540 0.780 0.415 0 11 11 ## POV78 513 0.131 0.337 0 00 01 ## EXP 54016.900 4.433 1.154 14.59617.510 20.19723.558 ## EARNINGS54019.63614.416 2.130 10.75816.000 23.155120.190## HOURS 54040.531 9.186 104040 4560 ## TENURE540 7.034 6.427 0.019 1.9374.692 10.98124.942 ## COLLBARG540 0.187 0.390 0 00 01 ## CATGOV540 0.228 0.420 0 00 01 ## CATPRI540 0.754 0.431 0 11 11 ## CATSE 540 0.019 0.135 0 00 01 ## URBAN 540 0.724 0.483 0 01 12 ## REGNE 540 0.143 0.350 0 00 01 ## REGNC 540 0.337 0.473 0 00 11 ## REGW540 0.157 0.365 0 00 01 ## REGS540 0.363 0.481 0 00 11 ## ————————————————————————–TidyverseTidyverse is a nice R package combining several packages for datamanagement and plotting, including dplyr tidyr and ggplot2. This package allows us to easily manipulate data. We can summarize the data with summarize.if(!require(tidyverse)){install.packages(“tidyverse”)}## Loading required package: tidyverse## — Attaching packages ———————————————————————— tidyverse 1.2.1 –## v ggplot2 3.1.0 v purrr 0.3.2## v tibble2.1.1 v dplyr 0.8.1## v tidyr 0.8.1 v stringr 1.3.0## v readr 1.1.1 v forcats 0.4.0## — Conflicts ————————————————————————— tidyverse_conflicts() –## x dplyr::filter() masks stats::filter()## x dplyr::lag()masks stats::lag()library(tidyverse)# Package for data manipulation# the following code gives same result as # mean(mydata$EARNINGS)mydata %>%
summarize(mean(EARNINGS))
## mean(EARNINGS)
## 1 19.63622
Create new varables with mutate.
# create and add a new variable
mydata %>%
mutate(EDUCUNI = EDUCBA + EDUCMAST) %>%
head()
## ID FEMALE MALE ETHBLACK ETHHISP ETHWHITE AGES EDUCPROF EDUCPHD
## 1 5531010 0145 120 0
## 2 2658010 1040 120 0
## 3 5365010 0138 150 0
## 4 4468010 0143 130 0
## 5 3142010 0138 180 0
## 6 2170011 0039 160 0
## EDUCMAST EDUCBA EDUCAA EDUCHSD EDUCDO SINGLE MARRIED DIVORCED FAITHN
## 1000 100 100
## 2000 010 011
## 3001 000 010
## 4000 100 100
## 5100 000 100
## 6000 000 100
## FAITHP FAITHC FAITHJ FAITHO ASVAB01 ASVAB02 ASVAB03 ASVAB04 ASVAB05
## 110005864525654
## 200003239292927
## 300014240373842
## 410006655595351
## 510006465616247
## 610006253515958
## ASVAB06 ASVABC HEIGHT WEIGHT85 WEIGHT02 SM SF SIBLINGS LIBRARY POV78
## 156 60.89985 6716020088 11 0 0
## 222 33.63790 67185205553 0 1
## 345 38.81767 69135185 11 123 1 0
## 452 57.08318 72200250 12 162 1 0
## 534 65.53439 76185220 16 201 1 0
## 645 55.44746 69180215 12 122 0 0
## EXP EARNINGS HOURSTENURE COLLBARG CATGOV CATPRI CATSE URBAN
## 1 22.38461553.4145 2.7500000001 0 0
## 28.903846 8.0040 2.3846154001 0 0
## 3 13.25000024.0040 5.7500000101 0 1
## 4 18.25000029.5040 6.1346154101 0 1
## 5 13.76923132.0554 0.8269231010 0 1
## 6 11.69230714.7340 4.2884617001 0 1
## REGNE REGNC REGW REGS EDUCUNI
## 1 0 001 0
## 2 0 001 0
## 3 0 100 0
## 4 0 010 0
## 5 1 000 1
## 6 0 001 0
Combine it with other functions, e.g. stargazer.
# stargazer with dplyr
mydata %>%
mutate(EDUCUNI = EDUCBA + EDUCMAST) %>%
stargazer(type = text)
##
## ================================================
## StatisticNMeanSt. Dev. Min Max
##
## ID540 3,330.191 2,684.020 18 12,110
## FEMALE540 0.500 0.500 0 1
## MALE540 0.500 0.500 0 1
## ETHBLACK540 0.106 0.308 0 1
## ETHHISP 540 0.052 0.222 0 1
## ETHWHITE540 0.843 0.365 0 1
## AGE 54040.919 2.323 37 45
## S 54013.672 2.438 720
## EDUCPROF540 0.007 0.086 0 1
## EDUCPHD 540 0.002 0.043 0 1
## EDUCMAST540 0.052 0.222 0 1
## EDUCBA540 0.194 0.396 0 1
## EDUCAA540 0.089 0.285 0 1
## EDUCHSD 540 0.548 0.498 0 1
## EDUCDO540 0.078 0.268 0 1
## SINGLE540 0.148 0.356 0 1
## MARRIED 540 0.657 0.475 0 1
## DIVORCED540 0.194 0.396 0 1
## FAITHN540 0.046 0.258 -31
## FAITHP540 0.496 0.522 -31
## FAITHC540 0.320 0.490 -31
## FAITHJ540-0.004 0.136 -31
## FAITHO540 0.111 0.348 -31
## ASVAB01 54049.996 9.540 22 68
## ASVAB02 54050.476 9.808 31 66
## ASVAB03 54049.709 9.376 24 61
## ASVAB04 54050.356 9.677 20 62
## ASVAB05 54050.780 9.470 20 62
## ASVAB06 54050.222 9.530 22 72
## ASVABC54051.363 9.568 25.459 66.080
## HEIGHT54067.674 4.238 48 80
## WEIGHT85540156.748 34.880 95 322
## WEIGHT02540184.013 44.828 95 400
## SM54011.580 2.816 020
## SF54011.837 3.537 020
## SIBLINGS540 3.274 2.188 013
## LIBRARY 540 0.780 0.415 0 1
## POV78 513 0.131 0.337 0 1
## EXP 54016.900 4.433 1.15423.558
## EARNINGS54019.63614.416 2.130120.190
## HOURS 54040.531 9.186 10 60
## TENURE540 7.034 6.427 0.01924.942
## COLLBARG540 0.187 0.390 0 1
## CATGOV540 0.228 0.420 0 1
## CATPRI540 0.754 0.431 0 1
## CATSE 540 0.019 0.135 0 1
## URBAN 540 0.724 0.483 0 2
## REGNE 540 0.143 0.350 0 1
## REGNC 540 0.337 0.473 0 1
## REGW540 0.157 0.365 0 1
## REGS540 0.363 0.481 0 1
## EDUCUNI 540 0.246 0.431 0 1
##
Perform operations by clusters.
# mutate/summarize by groups
mydata %>%
group_by(FEMALE) %>%
summarize(mean(EXP))
## # A tibble: 2 x 2
## FEMALE `mean(EXP)`
##
## 1017.9
## 2115.9
mydata %>%
group_by(MARRIED) %>%
mutate(DHOURS = HOURS mean(HOURS)) %>%
head()
## # A tibble: 6 x 52
## # Groups: MARRIED [2]
##ID FEMALEMALE ETHBLACK ETHHISP ETHWHITE AGE S EDUCPROF EDUCPHD
##
## 155310 10 0145120 0
## 226580 10 1040120 0
## 353650 10 0138150 0
## 444680 10 0143130 0
## 531420 10 0138180 0
## 621700 11 0039160 0
## # with 42 more variables: EDUCMAST
## # EDUCHSD
## # DIVORCED
## # FAITHJ
## # ASVAB03
## # ASVABC
## # SF
## # EARNINGS
## # CATGOV
## # REGNC
Perform more operations at the same time
# you can also filter rows and select columns
mydata %>%
mutate(AGE_D = ifelse(AGE <= 40, “below 40”, “above 40”)) %>%
group_by(AGE_D) %>%
summarise(mean(EARNINGS), mean(HOURS))
## # A tibble: 2 x 3
## AGE_D`mean(EARNINGS)` `mean(HOURS)`
##
## 1 above 40 19.940.3
## 2 below 40 19.340.8
Pick only columns and observations we are interested into.
# you can also filter rows and select columns
mydata %>%
filter(AGE > 40) %>%
select(EARNINGS, EXP, FEMALE) %>%
head()
## EARNINGSEXP FEMALE
## 153.41 22.384610
## 229.50 18.250000
## 311.78 22.038460
## 427.30 16.384610
## 555.05 17.307690
## 612.30 22.403850
Now, lets compare the tidyverse approach with the classic one. The tidyverse approach is easier to follow, particularly even when the actions we want to perform are more complex.
# the two following lines of code perform the same operation
# tidyverse approach
mydata %>%
filter(AGE > 40 & EDUCBA == 1) %>%
group_by(URBAN) %>%
summarise(mean(EARNINGS))
# classic R
summarise(
group_by(
filter(
mydata, AGE > 40 & EDUCBA == 1),
URBAN),
mean(EARNINGS)
)
Now we can try to perform more complex operations. For example, variables FAITH have some strange values (-3) when they should be binary. We remove strage values from FAITH variables. Moreover, we can convert all binary variables from numeric to qualitative (i.e.charcaters or factors). To achieve this we can use more advanced functions like mutate_if.
# We need to create a function which tells as whether a variable is binary
# the function below takes a vector x and returns TRUE if all elements in x
# are 0 or 1 (i.e. if x is binary), and FALSE otherwise.
is.binary <- function(x){all(x %in% 0:1)} # modify the dataset and save it into new object mydata2mydata2 <- mydata %>%
mutate(FAITHC = ifelse(FAITHC < 0, 0, FAITHC)) %>% # replace all values below 0 with 0
mutate(FAITHJ = ifelse(FAITHJ < 0, 0, FAITHJ)) %>%
mutate(FAITHN = ifelse(FAITHN < 0, 0, FAITHN)) %>%
mutate(FAITHO = ifelse(FAITHO < 0, 0, FAITHO)) %>%
mutate(FAITHP = ifelse(FAITHP < 0, 0, FAITHP)) %>%
mutate(URBAN = factor(URBAN)) %>%
mutate_if(is.binary, # mutate all variables which satisfy this condition
factor)# apply function factor() to all columns satisfying the above condition
mydata2 %>% stargazer(type = text)
##
## ================================================
## StatisticNMeanSt. Dev. Min Max
##
## ID540 3,330.191 2,684.020 18 12,110
## AGE 54040.919 2.323 37 45
## S 54013.672 2.438 720
## ASVAB01 54049.996 9.540 22 68
## ASVAB02 54050.476 9.808 31 66
## ASVAB03 54049.709 9.376 24 61
## ASVAB04 54050.356 9.677 20 62
## ASVAB05 54050.780 9.470 20 62
## ASVAB06 54050.222 9.530 22 72
## ASVABC54051.363 9.568 25.459 66.080
## HEIGHT54067.674 4.238 48 80
## WEIGHT85540156.748 34.880 95 322
## WEIGHT02540184.013 44.828 95 400
## SM54011.580 2.816 020
## SF54011.837 3.537 020
## SIBLINGS540 3.274 2.188 013
## POV78 513 0.131 0.337 0 1
## EXP 54016.900 4.433 1.15423.558
## EARNINGS54019.63614.416 2.130120.190
## HOURS 54040.531 9.186 10 60
## TENURE540 7.034 6.427 0.01924.942
##
Univariate analysis
We can also easily plot our data. We can use histograms for numerical variables.
mydata2 %>%
ggplot(aes(EARNINGS)) +
geom_histogram(bins = 100,
color = grey30,
fill = white) +
ylab() +
xlab() +
ggtitle(Histograms of numerical variables)
We can add a density line, but we need to show histograms as probabilities.
mydata2 %>%
ggplot(aes(EARNINGS)) +
geom_histogram(aes(y = ..density..),
bins = 100,
color = grey30,
fill = white) +
geom_density(alpha = .2,
fill = antiquewhite3) +
ylab() +
xlab() +
ggtitle(Histograms of Earnings)
We can also perform multi-plotting by some conditions, using package reshape2.
if(!require(reshape2)){install.packages(reshape2)}
library(reshape2) # package for data reshaping
# Prepare data for Histogram plots of numerical variables
mydata2 %>%
select_if(negate(is.factor)) %>%# Remove categorical var.
mutate(ID = factor(ID)) %>% # id to a categorical variable
melt(id.vars=c(ID)) %>% # Reshape dataset for multiplots
ggplot(aes(value)) +# Histograms of numerical var.
geom_histogram() +
facet_wrap(~variable, scales = free) +
ylab() +
xlab() +
ggtitle(Histograms of numerical variables)
For qualitative analysis you can try for yourself geom_bar() and geom_count().
Bivariate analysis
Lets start by checking correlations. This can help to identify possible multi-collinearity issues.
if(!require(corrplot)){install.packages(corrplot)}
library(corrplot)# package for correlation plotting
# Prepare data for correlation plot of numerical variables
mydata2 %>%
na.omit() %>%
select_if(negate(is.factor)) %>% # Remove categorical var.
select(-ID) %>% # Remove id
cor() %>%
corrplot(order = hclust, # Correlation plot
type = upper,
tl.col = black,
tl.srt = 45)
We can then perform boxplots/histograms/bars by groups.
mydata2 %>%
na.omit() %>%
select(ID, FEMALE, EARNINGS) %>%
mutate(ID = factor(ID)) %>% # id to a categorical var.
mutate(FEMALE = factor(FEMALE)) %>% # FEMALE to a categorical var.
melt(id.vars=c(ID,FEMALE)) %>% # Reshape dataset
ggplot() +
geom_boxplot(aes(x = FEMALE,
y = value,
fill = FEMALE)) +
facet_wrap(~variable, scales = free) +
xlab() +
ylab() +
ggtitle(Boxplots of Earnings variables by Gender)
mydata2 %>%
ggplot(aes(x = EXP,
y = log(EARNINGS),
color = URBAN)) +
geom_point() +
geom_smooth(method = lm, se = F) +
xlab() +
ylab() +
ggtitle(Earnings on Gender by Urban)
Univariate regression.
mod0 <- mydata2 %>%
lm(log(EARNINGS) ~ EXP, data = .)
mod0 %>% summary()
##
## Call:
## lm(formula = log(EARNINGS) ~ EXP, data = .)
##
## Residuals:
##Min 1Q Median 3QMax
## -2.00515 -0.41689 -0.032410.358662.02727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.449410 0.09882324.786< 2e-16 ***## EXP 0.020271 0.005656 3.5840.00037 ***## —## Signif. codes:0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1## ## Residual standard error: 0.5822 on 538 degrees of freedom## Multiple R-squared:0.02332,Adjusted R-squared:0.0215 ## F-statistic: 12.84 on 1 and 538 DF,p-value: 0.0003695Multivariate analysisWe can now start with multiple linear regression model.mod1 <- mydata2 %>%
lm(log(EARNINGS) ~ AGE + EXP +
EDUCPROF + EDUCPHD +
EDUCMAST + EDUCBA +
EDUCAA + EDUCHSD +
EDUCDO, data = .)
mod1 %>% summary()
##
## Call:
## lm(formula = log(EARNINGS) ~ AGE + EXP + EDUCPROF + EDUCPHD +
## EDUCMAST + EDUCBA + EDUCAA + EDUCHSD + EDUCDO, data = .)
##
## Residuals:
##Min 1Q Median 3QMax
## -1.77593 -0.32854 -0.031320.302122.01786
##
## Coefficients:
##Estimate Std. Error t value Pr(>|t|)
## (Intercept)3.671919 0.411639 8.920< 2e-16 ***## AGE -0.030217 0.009872-3.061 0.002319 ** ## EXP0.036813 0.005303 6.942 1.13e-11 ***## EDUCPROF11.322317 0.281213 4.702 3.29e-06 ***## EDUCPHD1 0.135952 0.518411 0.262 0.793232## EDUCMAST10.244102 0.158043 1.545 0.123057## EDUCBA10.123178 0.134919 0.913 0.361672## EDUCAA1 -0.247465 0.145376-1.702 0.089296 .## EDUCHSD1-0.437111 0.129628-3.372 0.000801 ***## EDUCDO1 -0.651896 0.148039-4.404 1.29e-05 ***## —## Signif. codes:0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1## ## Residual standard error: 0.5024 on 530 degrees of freedom## Multiple R-squared:0.2835, Adjusted R-squared:0.2714 ## F-statistic: 23.31 on 9 and 530 DF,p-value: < 2.2e-16mod1 %>% stargazer(type = text)
##
## ===============================================
## Dependent variable:
##
##log(EARNINGS)
##
## AGE-0.030***
## (0.010)
##
## EXP0.037***
## (0.005)
##
## EDUCPROF11.322***
## (0.281)
##
## EDUCPHD1 0.136
## (0.518)
##
## EDUCMAST10.244
## (0.158)
##
## EDUCBA10.123
## (0.135)
##
## EDUCAA1 -0.247*
## (0.145)
##
## EDUCHSD1 -0.437***
## (0.130)
##
## EDUCDO1-0.652***
## (0.148)
##
## Constant 3.672***
## (0.412)
##
##
## Observations540
## R2 0.284
## Adjusted R20.271
## Residual Std. Error0.502 (df = 530)
## F Statistic 23.306*** (df = 9; 530)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01We can compare two models together.stargazer(mod0, mod1,type = “text”)## ## ===================================================================## Dependent variable:## ———————————————–##log(EARNINGS) ## (1) (2)## ——————————————————————-## AGE-0.030*** ## (0.010)#### EXP0.020***0.037***## (0.006) (0.005)#### EDUCPROF11.322***## (0.281)#### EDUCPHD1 0.136 ## (0.518)#### EDUCMAST10.244 ## (0.158)#### EDUCBA10.123 ## (0.135)#### EDUCAA1 -0.247*## (0.145)#### EDUCHSD1 -0.437*** ## (0.130)#### EDUCDO1-0.652*** ## (0.148)#### Constant 2.449***3.672***## (0.099) (0.412)#### ——————————————————————-## Observations540 540## R2 0.023 0.284 ## Adjusted R20.021 0.271 ## Residual Std. Error0.582 (df = 538)0.502 (df = 530)## F Statistic 12.843*** (df = 1; 538) 23.306*** (df = 9; 530)## ===================================================================## Note: *p<0.1; **p<0.05; ***p<0.01Test HypothesisAfter we made sure the model is well-specified we can use it to test hypothesis.library(lmtest)library(sandwich)library(car)# we can test that the coeff of AGE=0linearHypothesis(mod1, “AGE = 0”)## Linear hypothesis test## ## Hypothesis:## AGE = 0## ## Model 1: restricted model## Model 2: log(EARNINGS) ~ AGE + EXP + EDUCPROF + EDUCPHD + EDUCMAST + EDUCBA + ## EDUCAA + EDUCHSD + EDUCDO## ## Res.DfRSS Df Sum of SqF Pr(>F)
## 1531 136.13
## 2530 133.7712.3646 9.3688 0.002319 **
##
## Signif. codes:0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
# test coef of EXP=0.5
linearHypothesis(mod1, EXP = 0.5)
## Linear hypothesis test
##
## Hypothesis:
## EXP = 0.5
##
## Model 1: restricted model
## Model 2: log(EARNINGS) ~ AGE + EXP + EDUCPROF + EDUCPHD + EDUCMAST + EDUCBA +
## EDUCAA + EDUCHSD + EDUCDO
##
## Res.Df RSS Df Sum of SqFPr(>F)
## 1531 2059.49
## 2530133.7711925.7 7629.9 < 2.2e-16 ***## —## Signif. codes:0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1# test coef of AGE equal – coef of EXPlinearHypothesis(mod1, “AGE = -EXP”)## Linear hypothesis test## ## Hypothesis:## AGE+ EXP = 0## ## Model 1: restricted model## Model 2: log(EARNINGS) ~ AGE + EXP + EDUCPROF + EDUCPHD + EDUCMAST + EDUCBA + ## EDUCAA + EDUCHSD + EDUCDO## ## Res.DfRSS Df Sum of SqF Pr(>F)
## 1531 133.88
## 2530 133.771 0.11699 0.4635 0.4963
# test same as before
linearHypothesis(mod1, AGE + EXP = 0)
## Linear hypothesis test
##
## Hypothesis:
## AGE+ EXP = 0
##
## Model 1: restricted model
## Model 2: log(EARNINGS) ~ AGE + EXP + EDUCPROF + EDUCPHD + EDUCMAST + EDUCBA +
## EDUCAA + EDUCHSD + EDUCDO
##
## Res.DfRSS Df Sum of SqF Pr(>F)
## 1531 133.88
## 2530 133.771 0.11699 0.4635 0.4963
# test several hypothesis simultaneously
linearHypothesis(mod1, c(EDUCPHD1 = 0,
EDUCMAST1 = 0,
EDUCBA1 = 0,
EDUCAA1 = 0))
## Linear hypothesis test
##
## Hypothesis:
## EDUCPHD1 = 0
## EDUCMAST1 = 0
## EDUCBA1 = 0
## EDUCAA1 = 0
##
## Model 1: restricted model
## Model 2: log(EARNINGS) ~ AGE + EXP + EDUCPROF + EDUCPHD + EDUCMAST + EDUCBA +
## EDUCAA + EDUCHSD + EDUCDO
##
## Res.DfRSS Df Sum of SqF Pr(>F)
## 1534 139.62
## 2530 133.774 5.848 5.7925 0.000144 ***
##
## Signif. codes:0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Predictions
We can also use the model to forecast unseen values with predict().
# lets estimate a model of earnings on age and experience only,
# and then forecast the salary of two individuals
# one has age = 40 and exp = 18
# the other has age = 42 and exp = 20
mydata2 %>%
lm(log(EARNINGS) ~ AGE + EXP, data = .) %>%
predict(newdata = data.frame(
AGE = c(40, 42),
EXP = c(18, 20)
))
##12
## 2.834429 2.844080
The above forecasted salaries are in logs so they need to be re-transformed
Diagnostics (Advanced)
There are several things that need to be checked in a linear model Heteroskedasticity (Quasi)-Multicollinearity Non-Normality Misspecification Autocorrelation and Non-stationarity (for time-series)
Heteroskedasticity happens when the variance of the residuals is correlated with the covariates. Heteroskedasticity will not bias the estimated coefficients but it will bias their standard errors, meaning that your tests might erroneously conclude that some coefficients are significant when, in reality, they are not, and vice-versa. In case of Heteroskedasticity we must use robust standard errors to test our hypothesis or we can use log transformation of our variables. We can check graphically if the squared residuals are correlated with the model covariates.
We can also run the Breusch-Pagan test. The null hypothesis of this test is homoscedasticity. Breusch-Pagan test runs a regression of squared residuals over covariates, and then tests N*R2 ~ Chi2(k) where k is the number of paramters.
mod1 %>% bptest()
##
##studentized Breusch-Pagan test
##
## data:.
## BP = 8.9716, df = 9, p-value = 0.4399
If we have heteroskedasticity we need robust standard errors.
standard_se <- function(x) {sqrt(diag(vcov(x)))}robust_se <- function(x) {sqrt(diag(vcovHAC(x)))}# fist column uses classic s.e. while second one uses robust s.e.stargazer(mod1, mod1,type = “text”, se = list(standard_se(mod1),robust_se(mod1)))## ## ===========================================================##Dependent variable: ##—————————-## log(EARNINGS)## (1)(2) ## ———————————————————–## AGE-0.030***-0.030***## (0.010)(0.009) #### EXP 0.037***0.037*** ## (0.005)(0.006) #### EDUCPROF1 1.322***1.322*** ## (0.281)(0.227) #### EDUCPHD1 0.1360.136## (0.518)(0.143) #### EDUCMAST10.2440.244## (0.158)(0.168) #### EDUCBA10.1230.123## (0.135)(0.151) #### EDUCAA1 -0.247*-0.247## (0.145)(0.159) #### EDUCHSD1 -0.437***-0.437***## (0.130)(0.140) #### EDUCDO1-0.652***-0.652***## (0.148)(0.157) #### Constant3.672***3.672*** ## (0.412)(0.389) #### ———————————————————–## Observations540540 ## R2 0.2840.284## Adjusted R20.2710.271## Residual Std. Error (df = 530) 0.5020.502## F Statistic (df = 9; 530)23.306***23.306***## ===========================================================## Note: *p<0.1; **p<0.05; ***p<0.01We can also test linear hypothesis with robust standard errors.# test with robust standard errorslinearHypothesis(mod1, “AGE = 0”,vcov. = vcovHAC)## Linear hypothesis test## ## Hypothesis:## AGE = 0## ## Model 1: restricted model## Model 2: log(EARNINGS) ~ AGE + EXP + EDUCPROF + EDUCPHD + EDUCMAST + EDUCBA + ## EDUCAA + EDUCHSD + EDUCDO## ## Note: Coefficient covariance matrix supplied.## ## Res.Df DfF Pr(>F)
## 1531
## 25301 10.578 0.001217 **
##
## Signif. codes:0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
We can test multicollinarity with the Variance inflation factor. Vif is a transformation the R2 obtained from regressing variable x_i on all other covariates. Therefore, we have one vif for every covariate. Values of vif above 10 can indicate that that variable is collinear with one or more of the others. Some researchers suggest using lower values such as 2.50. Check ?vif for more info.
mod1 %>% vif()
##AGEEXP EDUCPROFEDUCPHD EDUCMAST EDUCBA EDUCAAEDUCHSD
## 1.123057 1.180256 1.244010 1.062837 2.627282 6.100367 3.662003 8.904531
## EDUCDO
## 3.363237
Non-Normality happens when the residuals are not normally distributed often due to the presence of outliers that we can account for with dummy variables. We can use qqplot to plot if the residual behave normally. The plot compares the residuals of the model with those of a normal distribution. If residuals are normally distributed, they should be on the line y=x
mod1 %>% plot(2)
Misspecification happens when the model is wrongly specified. It can be due to several reasons, e.g.wrong functional form, (i.e.using linear model when the phenomenon is non-linear), endogeneity (i.e.omission of important covariates or simultaneous effects between dependent variable and one or more covariates). If the model is mis-specified all estimated coefficients will be biased and we need to modify our model. We can test mispecification (regarding linearity of the model) with the Ramseys reset test. The null hypothesis of this test is that the model is well specified.
mod1 %>% reset(power = 1:3)
##
##RESET test
##
## data:.
## RESET = 0.43615, df1 = 3, df2 = 527, p-value = 0.7272
The reset test is performed in the following way. We regress the dependent variable over all the covariates and on polynomials of the fitted values of our model, in this case squares and cubes. Then we need to test that the coefficients on the polynomials of the fitted values are simultaneously zeros.
Reviews
There are no reviews yet.