Assignment 1
Empirical Finance: Methods and Applications January 25, 2021
Datasets for problems 4, 5, and 6 are available on insendi.
You should submit a single pdf solution containing answers to all sub-parts of all problems (including
4-7). Typewritten solutions are preferred but handwritten and scanned solutions are acceptable.
Marks for each problem are listed below.
In addition, please submit code for problems 4-7 in the form of an R project. This should be a zipped folder that contains an R Project, a single R file with answers to all relevant parts of all problems, and all csv files (including those for 4-6 and any you download for problem 7). I should be able to download and run your R file directly. Please comment your code to make it as easy to interpret as possible.
Your marks depend on clarity of exposition in solutions and code. This includes figures and regression results.
You may discuss all problems with classmates but each student must independently write and submit their own solution. Solutions and code that have been clearly copied will cause the full assignment to receive 0 marks and may invite further disciplinary action.
Problem 1 (5 marks)
Suppose we see 5 observations of yi, Di, shown in the table below:
Consider the following linear model:
yi Di 10 81 41 00 31
yi = 0 + 1Di + vi.
Suppose we estimate this model on the data above via OLS. Please explicitly find OLS and OLS. 01
1
Problem 2 (10 Marks)
Relative to the United Kingdom, the United States has borrower friendly laws surrounding residential mort- gage default. Many US states are Non-Recoursethat is, if borrowers stop making the mortgage payments, lenders cannot hold them responsible beyond seizing the home itself. On the other hand, the United King- dom has Full-Recourse: lenders may seize cars, investments, garnish wages, et cetera. Many believe that the relative leniency of laws in the United States is responsible for higher rates of mortgage default.
For the sake of simplicity, assume laws may take only two forms: Non-Recourse (in the United States) or Full-Recourse (in the United Kingdom). Imagine we are interested in the causal (treatment) effect of Non-Recourse laws on mortgage default.
(a) Denote mortgage default for a borrower i by Di. In potential outcomes notation, write the average treatment effect of Non-Recourse laws on default. (3 marks)
(b) Suppose we compare the average default rates in the United States to the average default rates in the United Kingdom. Write this comparison in potential outcomes notation. (3 marks)
(c) Why does the expression in part (a) differ from that in part (b)? Please provide an explanation that is not simply mathematical, but that provides some intuition. Would you expect the answer in (b) to be higher or lower than that in (a)? Why? (4 marks)
Problem 3 (10 marks)
Suppose the relationship between yi and xi is as follows:
yi = 0 + 1xi + vi,
where xi is observable, E[vi|xi] = 0 and E[xi] = 0. However, suppose we do not see yi, but instead observe
yi = yi + i. Consider the regression:
You may assume that i has mean 0 and variance 2.
y i = 0 + 1 x i + u i ,
(a) Suppose that Cov(x , ) = 0. Will the OLS estimator ols using y instead of y be biased for ? Show
ii1ii1 why or why not. (5 marks)
(b) Suppose instead that i = xi +i, where = 0 and Xi and i are independent. Will the OLS estimator
ols using y instead of y be biased for ? Show why or why not. (5 marks) 1ii1
2
Problem 4 (20 marks)
The dataset rollingsales manhattan.xls contains details on 2020 real estate transactions in Manhattan.1
(a) Load the data into R and perform the following basic data cleaning exercises: 2
Relabel the column names to remove any spaces
One trick is names(dataset) < gsub( , , names(dataset). Remove any observations with the sale price equal to 0.Using this cleaned data, what neighborhood has the highest average sale price? (4 marks)(b) Create a new variable equal to log(sale price). Create another variable representing the age of the property in 2020 (i.e. years since the year it was built). Run an OLS regression of log(sale price) on age and a set of dummy variables for each neighborhood (omitting one). Report the coefficient on age. What does this indicate about the relationship between age and sale price in the sample as a whole? (4 marks)(c) Run an OLS regression of log(sale price) on age, but use only data from the Upper East Side below 79th street.3 Report the coefficient on age. What does indicate about the relationship between age and sale price in this particular neighborhood. (4 marks)(d) Plot the mean and median sale price and the total quantity of sales across months in 2020. This can be on multiple figures or a single figure, and you may choose the plotting style that you feel best presents the data. Please comment on and discuss any major patterns you see in these plots. (4 marks)(e) Create a chart showing a new (and hopefully interesting) pattern of your choice using this data. This may be a plot of any type, and may relate to the sale price or not. Please briefly describe the plot you have created. (4 marks) 1This data, and data for other New York City boroughs, can be found at https://www1.nyc.gov/site/finance/taxes/ property- rolling- sales- data.page.2One option is to covert this to a .csv file and use the read csv command we have worked with in class. However, I recommend using library(readxl) and the command read xls(). Note that the subcommand skip (e.g. read xls(dataset, skip=…)) can be useful for getting rid of the useless rows at the top of an excel spreadsheet.3NEIGHBORHOOD==UPPER EAST SIDE (59-79).3 Problem 5 (15 marks)On February 1st, 2015 the New England Patriots won Superbowl 49 to claim the NFL championship for the 2014-2015 season. The NFL commissioner believed that this had a major beneficial effect on the US economy and requested a study of the impact on stock prices. His researchers collected weekly data on stock prices for set of US firms from February 2014 to January 2016. Meanwhilebecause they were worried about the potential for aggregate time trendsthe researchers collected data from a set of Chinese firms included in the SSE Index over the same period to use as a control group.On insendi you will find a dataset labeled patriots.csv. In it you will find four variables: exchange: A string variable indicating whether the stock is listed on the Shanghai Stock Exchange (sse) or the New York Stock Exchange (nyse) date: Date stock id: A variable assigning a unique id number for each stock price: The closing price of the stock in question on the listed date(a) In R, perform a difference-in-difference analysis that compares US and Chinese stock prices, before and after February 1st, 2015. In particular, let Di be a dummy variable that is equal to one for US stocks and 0 for Chinese stocks. Let Tt be a dummy variable that is equal to 1 for any day after February 1st 2015 and 0 otherwise. Let yi represent the share price. Estimate the following regression model:yit =0 +1Di Tt +2Di +3Tt +vit Report the values of OLS, OLS, OLS, and OLS. (5 marks) 0123(b) If the necessary assumptions were true, which of these coefficients would represent the causal effect of a Patriots win on US stock prices. What would your estimated parameter suggest about this effect? (5 marks)(c) Create a plot of the average stock prices in (i) the US and (ii) China for every week between February 2014 and January 2016. Looking at the plot, do you believe the answer you suggested in part (b) is an accurate representation of the impact of the Patriots Superbowl win on US stock prices? Explain why or why not with explicit reference to the assumptions necessary for a valid difference-in-difference approach. (5 marks)4Problem 6 (15 marks)Disclaimer: this question uses techniques introduced in week 4, so you may want to wait before beginning. On insendi you will find two data files: regularization train.csv and regularization test.csv. Each contains asingle y variable and 50 x variables labeled x1, x2, , x50(a) Set the seed using set.seed(1234). Use cross validation (with 10 folds) to choose for a LASSO regressionof y on all x1, x2, , x50. Report the following: (5 marks)1. The value of that minimizes the cross-validated error.2. The largest value of that provides cross-validated error within one standard error of the minimum3. The number of non-zero coefficients in the model estimated with from part 2.4. The mean squared error from an out of sample test of the model using the testing data regulariza- tion test.csv (again using from part 2).(b) Repeat the exercise in part (a) using 20 fold cross validation. (5 marks)(c) Repeat the exercise in part (a) using elastic-net with = 0.1. (5 marks) 5Problem 7 (25 marks)Create a single compelling plot of your choice using financial data and ggplot. You must download or otherwise acquire some financial or financially relevant data. This may be fromBloomberg, an online provider (e.g. Yahoo Finance or fred.stlouisfed.org), or any other source. Your plot should demonstrate an interesting stylized fact. This could be, for example, the short run price response of an asset or group of assets to some salient event, long run changes in investor flows, or anything that you find exciting. Please be creative and try to identify something that I or your classmates might find interesting. Each student must perform this task individually. You are responsible for acquiring data yourself. You may not use data that has been downloaded by a classmate and you may not produce the same plot as a classmate. Your answer to this question should cover a single page. The top half of the page will contain the plot, the bottom half will contain an explanation of the stylized fact. Feel free to supplement with regression analysis or other support from the data. Your graph will be judged on clarity. Pay attention to labeling and scaling. I will select a few interesting plots to present and discuss in class (with permission of the authors).6
Reviews
There are no reviews yet.