Panel Data in R
Panel Data in R
Davide Benedetti
Working with Panel datasets
Lets import a dataset and perform some analysis on it.
# After changing the dir lets load our dataset as usual
# setwd(/your/folder)
library(foreign)
prod <- read.dta(“prod_balanced.dta”)library(stargazer)stargazer(prod, type = “text”)## ## ==============================================================## Statistic N MeanSt. Dev. MinMax ## ————————————————————–## year14,504 82.500 2.29179 86## sic4dig 14,5043,344.621249.9303,1113,909## id14,504 54,266.71028,160.68072 99,513## go14,504 102,449.400 302,269.600 665.176 5,539,650.000## m 14,504 48,869.760147,895.700 155.503 3,134,312.000## l 14,504 62.029107.40951,696## k 14,504 43,538.290206,041.1008.1306,947,903.000## sic3dig 14,504 334.093 25.020311390 ## sic2dig 14,504 33.223 2.45431 39## minyear 14,504 79.000 0.00079 79## maxyear 14,504 86.000 0.00086 86## countyear 14,5048.000 0.000 88## va14,504 53,579.600177,919.800 143.804 4,870,663.000## lnvaOl14,5045.941 0.891 2.5019.094## lnk 14,5048.854 1.655 2.096 15.754## ————————————————————–head(prod)## year sic4dig idgomlk sic3dig sic2dig minyear## 1 863813 729824.879 5117.624 23 1575.752 3813879## 2 793813 72 11606.232 5185.277 15 2402.043 3813879## 3 853813 727129.540 4211.143 19 1727.427 3813879## 4 803813 729808.034 6124.000 15 2602.932 3813879## 5 823813 72 11243.599 3619.531 16 2335.772 3813879## 6 833813 725957.458 2257.693 13 2102.915 3813879## maxyear countyear va lnvaOllnk## 186 8 4707.255 5.321366 7.362488## 286 8 6420.956 6.059272 7.784075## 386 8 2918.397 5.034351 7.454388## 486 8 3684.034 5.503714 7.864394## 586 8 7624.067 6.166477 7.756098## 686 8 3699.764 5.651075 7.651080The object prod si our usual data.frame, i.e.a dataset with named columns. The difference from a standard cross-sectional data however, is that it contains two index columns, one for the id of the individual measured (e.g., people, stocks, firms, countries, etc.), and the other for the time the individual was measured (e.g., days, months, quarters, years, etc.).A panel dataset is simply a cross-sectional data measured at different points in time. The additional time dimension will enable as to perform additional analyis as well as visualize the data either by time or by cross-section.The number of units in a panel dataset could be the same every period. This case is known as balanced panel, and it happens when the same individuals are observed in every period with no changes. However, more often the number of individuals changes from one period to another. Think about a panel of companies where some companies might default and exit the sample after some years while new ones, that did not exist before, are created.Our datset contains variables of several companies measured over a few years. We can use tidyverse to count how many companies we have every year, and see what years we are considering. And we have a constant of 1813 companies from 1979 to 1986 (balanced panel).library(tidyverse)prod %>%
group_by(year) %>% # this gives me the number of companies every year
summarise(n = n()) 
## # A tibble: 8 x 2
##year n
##
## 1791813
## 2801813
## 3811813
## 4821813
## 5831813
## 6841813
## 7851813
## 8861813 
Plotting variables of Panel datasets
As previously said, we can exploit the two dimensions of the data to say something about both about the cross-sectional distribution of the variables, and well as their dynamic evolution over time.
In the following code, we are averaging variable K (capital) by id (company number) thus removing the time dimension, and we are then plotting its distribution with an histogram.
You can try to visualize the ranking of which company had highest average which geom_bar.
library(tidyverse)
prod %>%
group_by(id) %>%
summarise(k_ave = mean(lnk)) %>%
ggplot(aes(k_ave)) +
geom_histogram(aes(y = ..density..)) +
ggtitle(Histogram of cross-sectional average of log of Capital) +
xlab() +
theme_minimal() 
Below we are instead averaging the same variable by time, therefore killing the cross-section and transforming the panel into a time-series of the average of log of capital over time.
prod %>%
group_by(year) %>%
summarise(k_ave = mean(lnk), k_se = sd(lnk)/sqrt(n())) %>%
ggplot(aes(x = year, y = k_ave)) +
geom_line() +
geom_point(size = 0.5) +
geom_ribbon(aes(ymin = k_ave  1.96*k_se,
ymax = k_ave + 1.96*k_se),
alpha = 0.2) +
ylab() +
ggtitle(Time-series of average of log of Capital) +
theme_minimal() 
We can also perform scatter plots of two variables colouring by time or id. In this dataset we have many ids, so we are plotting only a few of them by selecting 8 random ids among all unique ids.
set.seed(800)
prod %>%
filter(id %in% sample(unique(id), size = 8)) %>%
mutate(`Company ID` = factor(id)) %>%
ggplot(aes(x = lnk, y = lnvaOl, color = `Company ID`)) +
geom_point(shape = 3) +
ylab() +
xlab() +
ggtitle(Log of Production over Labour by log of Capital) +
theme_minimal() 
Working with dummies
Most of the variables in the dataset are sector codes, identifying the sector of the company. These variables are recorded as numbers, but they are qualitative variables. However, R will recognize and analyse them as quantitative variables unless we tell him not to do it.
In the following model, we am regressing log of production over labour (lnvaOl) on the 2 -digits sector code (sec2dig). However, R is estimating the model in the wrong way, considering this covariate as quantitative.
prod %>%
lm(lnvaOl ~ sic2dig, data = .) %>%
stargazer(type = text) 
##
## ===============================================
## Dependent variable:
## 
## lnvaOl
## 
## sic2dig0.057***
## (0.003)
##
## Constant 4.060***
## (0.099)
##
## 
## Observations14,504
## R2 0.024
## Adjusted R20.024
## Residual Std. Error 0.880 (df = 14502)
## F Statistic 361.865*** (df = 1; 14502)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01We need to convert this variable as factor, if we want that R interprets it as a qualitative variable. This way R will estimate a beta for every sector code.Note that one sector code is missing (sector #31). The value of this sector is absorbed by the constant. This means that sector 31 has an average log productio over labour of 5.787, and all the other values have to be interpreted as difference from it. E.g., companies in industrial sector 32 have on average 7.1% more production over labour than compnaies in sector 31.prod %>%
mutate(sic2dig = factor(sic2dig)) %>%
lm(lnvaOl ~ sic2dig, data = .) %>%
stargazer(type = text) 
##
## ===============================================
## Dependent variable:
## 
## lnvaOl
## 
## sic2dig320.071***
## (0.020)
##
## sic2dig33-0.136***
## (0.027)
##
## sic2dig340.319***
## (0.030)
##
## sic2dig350.714***
## (0.024)
##
## sic2dig360.387***
## (0.042)
##
## sic2dig370.621***
## (0.077)
##
## sic2dig380.212***
## (0.023)
##
## sic2dig39 -0.337
## (0.215)
##
## Constant 5.787***
## (0.012)
##
## 
## Observations14,504
## R2 0.075
## Adjusted R20.074
## Residual Std. Error 0.857 (df = 14495)
## F Statistic 146.253*** (df = 8; 14495)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01In the next two models, we are regressing the same variable (lnvaOl) on year (column 1), and on year transformed as factor (column 2). Can you explain the difference between the two models?mod1 <- lm(lnvaOl ~ year, data = prod)mod2 <- lm(lnvaOl ~ factor(year), data = prod)stargazer(mod1, mod2, type = “text”)## ## =======================================================================## Dependent variable:## —————————————————## lnvaOl ##(1) (2) ## ———————————————————————–## year-0.012***##(0.003) #### factor(year)80 0.072** ##(0.030) #### factor(year)810.151*** ##(0.030) #### factor(year)82 0.064** ##(0.030) #### factor(year)83 -0.027##(0.030) #### factor(year)840.001##(0.030) #### factor(year)85-0.075** ##(0.030) #### factor(year)860.036##(0.030) #### Constant6.952***5.913*** ##(0.266) (0.021) #### ———————————————————————–## Observations 14,50414,504## R20.001 0.005## Adjusted R2 0.001 0.005## Residual Std. Error0.890 (df = 14502)0.889 (df = 14496)## F Statistic 14.443*** (df = 1; 14502) 11.066*** (df = 7; 14496)## =======================================================================## Note: *p<0.1; **p<0.05; ***p<0.01Panel regressionWe can run panel regression in R with function plm from the homonymous package. To use plm, the dataset needs to be converted in panel format using function pdata.frame, in which we specify the names of the columns in the data containing the id and time index respectively.library(plm)prod_plm <- pdata.frame(prod, index = c(“id”, “year”))Panel models with fixed effects are the most common types of panel regressions used in economics. They are simply linear models with a different intercept for either every individual or period in the dataset, or both. By default, function plm runs panel data models with individual fixed effects.Below we are comparing the results from lm (column 1) and plm with individual fixed effects (column 2). While lm estimates one unique constant, plm estimates one for every company (they are not printed out but can be shown with function fixef).mod1 <- lm(lnvaOl ~ lnk, data = prod)mod2 <- plm(lnvaOl ~ lnk, data = prod_plm,effect = “individual”)stargazer(mod1, mod2, type = “text”,omit.stat = c(“rsq”,”adj.rsq”))## ## ==========================================================================##Dependent variable:## ——————————————————## lnvaOl## OLSpanel## linear## (1) (2) ## ————————————————————————–## lnk 0.283*** 0.062*** ## (0.004) (0.014) ## ## Constant3.434***## (0.034) ## ## ————————————————————————–## Observations 14,504 14,504## Residual Std. Error0.757 (df = 14502) ## F Statistic 5,556.180*** (df = 1; 14502) 19.109*** (df = 1; 12690)## ==========================================================================## Note:*p<0.1; **p<0.05; ***p<0.01You can use argument effect in function plm to specify whether you want to use individual or time fixed effects, or both. See example below.mod1 <- plm(lnvaOl ~ lnk, data = prod_plm,effect = “individual”)mod2 <- plm(lnvaOl ~ lnk, data = prod_plm,effect = “time”)mod3 <- plm(lnvaOl ~ lnk, data = prod_plm,effect = “twoways”)stargazer(mod1, mod2, mod3, type = “text”,omit.stat = c(“rsq”,”adj.rsq”))## ## ============================================================================================##Dependent variable:##——————————————————————————-##lnvaOl ## (1)(2)(3) ## ——————————————————————————————–## lnk0.062***0.283*** 0.038***## (0.014)(0.004)(0.014) ## ## ——————————————————————————————–## Observations14,50414,504 14,504 ## F Statistic19.109*** (df = 1; 12690) 5,554.331*** (df = 1; 14495) 7.061*** (df = 1; 12683)## ============================================================================================## Note:*p<0.1; **p<0.05; ***p<0.01Technically, estimating an individual fixed effect panel model is the same as estimating a linear model including dummy variables for each individual. At the same time, estimating a time fixed effect panel model is the same as estimating a linear model including dummy variables for each time period, while a panel with both individual and time effects can be obtained by estimating a linear model with dummy variables for both individuals and times.The difference is that plm performs the estimating much faster than lm and does not print automatically the different intercepts, but only the betas.Below, you can see the difference between lm with time dummies and plm with time effects. The beta for lnk is identical, but lm prints all intercepts for every year while plm does not.mod1 <- lm(lnvaOl ~ 0 + lnk + factor(year), data = prod)mod2 <- plm(lnvaOl ~ lnk, data = prod_plm,effect = “time”)stargazer(mod1, mod2, type = “text”,omit.stat = c(“rsq”,”adj.rsq”))## ## ===============================================================================## Dependent variable:## ———————————————————–## lnvaOl ##OLSpanel## linear ##(1) (2) ## ——————————————————————————-## lnk0.283***0.283***##(0.004) (0.004) #### factor(year)79 3.412***##(0.038) #### factor(year)80 3.469***##(0.038) #### factor(year)81 3.542***##(0.038) #### factor(year)82 3.465***##(0.038) #### factor(year)83 3.384***##(0.038) #### factor(year)84 3.422***##(0.038) #### factor(year)85 3.352***##(0.038) #### factor(year)86 3.467***##(0.038) #### ——————————————————————————-## Observations14,50414,504 ## Residual Std. Error 0.756 (df = 14495) ## F Statistic 100,275.000*** (df = 9; 14495) 5,554.331*** (df = 1; 14495)## ===============================================================================## Note: *p<0.1; **p<0.05; ***p<0.01As a final note, you can estimate a linear model with function plm by setting model to pooling.mod1 <- lm(lnvaOl ~ lnk, data = prod)mod2 <- plm(lnvaOl ~ lnk, data = prod_plm,model = “pooling”)stargazer(mod1, mod2, type = “text”,omit.stat = c(“rsq”,”adj.rsq”)) 

![[SOLVED] CS Panel Data in R](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip.jpg)

![[Solved] Modify your first program to print a table of the words](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip-1200x1200.jpg)
 
 
 
Reviews
There are no reviews yet.