Fall 2024 STAT 5140
Final Project
Dataset
• (Preferred) Option 1: Find your own data set of interest.
If you have difficulties to find your own data set of interest, you can consider the following two options.
• Option 2: The dataset for this project is collected from a clinical trial by the Mayo Clinic between 1974 and 1984 on studying Primary Biliary Cirrhosis (PBS). The dataset can be downloaded from the website http://lib.stat.cmu.edu/datasets/pbc. Please read the data description and variable definition carefully on the website.
• Option 3: The SEER datasets. Follow the instructions on
https://seer.cancer.gov/data/ to extract the SEER datasets. After extraction the data files are in the incidence directory. The same directory contains the important file seerdic.pdf that explains the meaning of the fields in the data files. The SEER database is extremely large and you will probably want to subset it. The www site of the American Cancer Society (www.cancer.org) has a great deal of interesting information about cancer that can help you choose meaningful projects. Indeed the statistics on that site are obtained through analysis of the SEER database.
Project Content
You are to use the methods of this course to analyze your selected data set. The goal of the project is to demonstrate how much survival analysis you have learned. Projects should have the following components:
• Some preliminary elementary calculations (Chapters 4, 5, and 7)
• Some type of regression modeling using both a Cox proportional hazards model (Chapters 8-9) and a parametric regression model (Chapter 12). The results of these models should be compared.
• Model validation using residual analysis (Chapter 11 and Section 12.5) for both the proportional hazards and parametric regression models.
• Both categorical and numerical variables, as well as some interaction terms, should be considered for inclusion in the model.
• An interpretative section outlining your important conclusions written in nontechnical language,e.g. “Being married decreases the risk of breast cancer by 40%” instead of “The coefficient of marital status is -0.51 and is significant” . Note log(.6) = -0.51. It would be nice if you had some significant interaction terms, both categorical*categorical and categorical*numerical, to interpret.
Final Report requirement
The project final report should focus on data introduction, question to answer, analysis techniques, and analysis result interpretation. The following format for the final report is suggested.
The final report should be no more than 15 pages from Section 1 to Section 5, without computing codes. Computing codes should be attached as an appendix. The final report should be submitted to Canvas website in pdf format.
• Section 1: Introduction to data.
• Section 2: Questions to address. Formulate appropriate questions on your own that can be addressed by analyzing this dataset.
• Section 3: Statistical analysis. A technical section describing in great detail what you did. Relevant portions of R/SAS output should be ‘cut and pasted’ into this section. This section should be self-contained and not require rummaging through mounds of output. In particular, I should not have to read the appendix (see below) to know what you did and to assess the correctness of your analysis.
• Section 4: Interpret the analysis results and answer the question(s) in Section 2.
• Section 5: Discussion on limitation of the analysis. Pros/cons of the used methods.
• Appendix: codes.
• If you choose Option 3, attached a copy of your SEER confidentiality agreement showing, in the upper lefthand corner, you name and number.
Reviews
There are no reviews yet.