UNIT CODE: ACFIM0005
UNIT NAME: Quantitative Methods, Big Data, and Machine Learning
October 2024
Overview
. The coursework represents 40% of the final mark for the unit.
• The coursework is in the form. of a report. The word limit for the written part of the assignment is 3,000 words (excluding tables, references, and appendices). Please note that this is the words limits, not a target. Output from Python, including charts or tables, can be pasted into the report.
• The code is required to attach in the Appendix, ideally with some necessary comments.
• The coursework is a group work –You need to arrange yourselves into groups of 3 or 4 people (groups smaller than 3 or larger than 4 are not permitted). All members of a given group will receive the same mark, and it is up to you to determine the allocation of work within the group and to ensure that all group members make a valid contribution. All members of a group should be happy with the whole submission, as you assume joint responsibility.
• Penalties will apply if the coursework is submitted late.
Coursework requirement
1. The following questions are based on the data used in”Empirical Asset Pricing via Machine Learning”
by Shihao Gu, Bryan Kelly, and Dacheng Xiu (Review of Financial Studies, Vol. 33, Issue 5, (2020), 2223-2273), henceforth GKX. You can download the GKX data directly from Dacheng Xiu’s website: https://dachxiu.chicagobooth.edu/, which contains 94 firm characteristics. Merge your data with the monthly stock return file on Blackboard using the PERMNO and date columns. Please include your code in the appendix with comments.
a. Randomly select 500 stocks from the dataset and define your own sample period , extracting their market beta and constructing the one-month ahead return. Cross-sectionally rank those stocks into decile portfolios based on their market betas in every month. Report the average one-month ahead return performance of each beta-sorted portfolio. Are those returns statistically significant from zero? [5]
[Hint 1: You can find the variable definitions in the internet appendix of the paper, which is also available on Dacheng Xiu’swebsite.]
[Hint 2: Make sure the randomly selected stock samples can be reproduceable.]
b. Please find the market index return and risk-free rate for the U.S. market. Calculate the excess return for each beta-sorted portfolio and perform. an OLS regression against the market excess return and obtain the full-sample beta coefficient. What do you observe?
i. Does a low-beta portfolio report a low beta coefficient, whereas a high-beta portfolio reports a high beta coefficient? What about their t-values and R-squared?
ii. Does a low-beta portfolio on average earn lower returns, whereas a high-beta portfolio earns higher returns?
Please analyse the statistics. [5]
c. Plot all the portfolios’ full-sample average excess returns against their beta coefficients in a graph.
Find a line fitted using ordinary least square estimates. What do you observe? [10] i. Is the line positively sloped, negatively sloped, or flat?
ii. Is the intercept significantly positive, significantly negative, or insignificant from zero? Please use regressions to provide statistical test.
d. Split your sample into two based on whether the market excess return is above or below the full- sample median. Repeat the 1.c exercise for each subsample and discuss your findings. [5]
e. Test for the validity of CAPM at the individual stock level. [10]
EXRet i,t+1 = α + β ∗ BETAi,t + Y ∗ xi,t + εi,t+1
i. For each stock i in each month t , conduct a Fama-MacBeth (FM) cross-sectional average regression using the above equation and include additional stock (firm)-level control variables of xi,t. [Hint: you need to justify the inclusion of the selected control variables.]
ii. Repeat the exercise but split the sample into two based on whether the market excess return is above or below the full-sample median. What do you observe and discuss your findings.
2. Using the same dataset as of Q1 but now include all the stocks, your goal is to predict the one month ahead returns by training different ML models using the large pool of 94 firm characteristics (20 of the characteristics in the GKX data have monthly frequency, while the rest are either quarterly or annual).
a. Choose all 20 of the monthly characteristics and add 10 other quarterly/annual characteristics of your choosing to obtain a list of 30 predictive features. Report the summary statistics for the features in your list and give a brief definition for each. [5]
b. Pre-process the predictive features by applying the rank normalization technique described in the paper (see section 2.1 footnote 29). [5]
c. Train two different ML models using Partial Least Squares (PLS), and Random Forest (RF) to predict one-month-ahead returns. Your out-of-sample testing period should start in January 1990 (i.e., the first out-of-sample prediction should be for February 1990) and end in November 2021 (last prediction is for December 2021). Be as explicit as possible while describing your training methodology. Make sure that there is no forward-looking bias (i.e., leakage). Choose appropriate metrics to measure the model fit and report both in-sample and out-of-sample performance. Compare your results across the two models. [30]
d. Compare the out-of-sample performance between 1990-1999 to the 2000-2021 performance. Is there a difference? [5]
e. Choose an appropriate method to measure the variable importance and report the variable importance results for two of your models. [5]
3. Continue to use the same dataset but only select 3 stocks, your goal now is to identify each focal stock’s potential leaders based on the following lead-lag models:
Where for each focal stock i in month t + 1, the independent candidate variables include all the other stock j’s monthly return in month t , as well as the stock i’s monthly return in month t.
a. Construct 3 new panel datasets where the index contains month t and the columns include
stock i or j’s monthly return. Pre-process the predictive features and ensure there is no missing values. [5]
b. Choose an appropriate rolling window length and perform. the above equation using LASSO
month by month. Your out-of-sample testing period should be every following month t + 2. [10] [Hint: A longer regression period is likely to reduce noise. But an overly long window will prevent you from uncovering relatively short-lived leader-follower stock pairs.]
i. Please describe in detail your training methodology and be careful about look-ahead bias. Choose appropriate metrics to measure the model fit and report both in-sample and out-of- sample performance.
ii. How many stocks are identified as stock leaders in the cross section? How many of
identified stock leaders share the same two-digit SIC code (i.e., sic2) with the focal stock?
How persistent (time-varying) are those identified stock leaders? Please provide tables/graphs to show your findings.
Marks will be awarded for:
Your work will be assessed in terms of how well you have carried out the various parts of the coursework, details of which are in the following:
1. Appropriate construction of variables to use in the models.
2. Correctly implement the econometric or machine learning tools.
3. Correctness, clarity, completeness and relevance of your interpretations and discussions for each question.
4. Presentation of coursework, including the clear structure of the report, the number of digits in the table (e.g., 2 decimals for all the numbers throughout the report), detailed descriptives for tables/graphs, consistent reference styles, etc.
5. Your understanding, and ability to interpret the software-generated output in terms of the concepts and ideas discussed in the lectures and classes.
6. You will NOT be assessed in terms of how well your model happens to fit the data, or whether you find particular variables are significant.
Reviews
There are no reviews yet.