The project asks you to develop, evaluate and compare models for the prediction of proteins that interact with DNA and RNA using a provided dataset. Your model must classify a given protein sequence into one of four outcomes, i.e., interacts with DNA (DNA), interacts with RNA (RNA), interacts with both DNA and RNA (DRNA), and does not interact with DNA or RNA (nonDRNA). Although each group will solve the same task, the corresponding designs should be unique, i.e., collaboration between groups is not allowed.
Datasets
Two datasets are/will be provided:
- txt (training dataset) that includes 391 DNA proteins, 523 RNA proteins, 22
DRNA proteins, and 7859 nonDRNA proteins, for the total of 8795 proteins.
- txt (blind test dataset) that includes 8795 proteins, with similar proportions between the four classes of proteins. This is an independent test set, which means that entire design procedure (including feature generation, feature selection, parameterization and selection of classifiers, etc.) should be completed using only the training dataset. The test dataset should be used to evaluate your system only once. This dataset will be posted on the class web site 2 days before the project submission deadline and it will not include the annotation of the outcomes. You will have to predict the outcomes and the instructor will process and assess these predictions.
The training dataset is provided in the comma-separated format where each protein is represented by:
- the amino acid sequence
- the class encoded as DNA, RNA, DRNA, and nonDRNA
Test dataset will be the same format as the training dataset, except that the outcomes will not be provided.
Evaluation of Predictions
You are required to perform the 5-fold cross validation when using the training dataset. This cross validation divides the training dataset into 5 random, equal-size subsets, where one subset is used to test the prediction model and the remaining four to train/develop the prediction model; this is repeated 5 times, each time using a different subset as the test set. Consequently, this test results in predicting every sequence in the training dataset. This test procedure is supported by RapidMiner.
For each of the four outcomes you will convert the dataset into a binary problem, i.e., a given outcome (positive outcome) vs. all other outcomes (negative outcomes). For example, all proteins that are labeled as DNA will be considered as positive, and the remaining proteins (RNA, DRNA and nonDRNA) as negative. Next, for each of the four outcomes you will compute the following measures:
Sensitivity = SENS = 100*TP / (TP + FN)
Specificity = SPEC = 100*TN / (TN + FP)
PredictiveACC = 100* (TP+TN) / (TP+FP+TN+FN)
MCC = (TP*TN FP*FN) / sqrt[(TP+FP)*(TP+FN)*(TN+FP)*(TN+FN)]
where TP is the number of true positives (correctly predicted positive outcomes), FP denotes false positives (negative outcomes that were predicted as positives), TN denotes true negatives (correctly predicted negative outcomes), FN stands for false negatives (positive outcomes that were predicted as negatives). You will also compute:
averageMCC = (MCCDNA + MCCRNA + MCCDRNA + MCCnonDRNA)/4 accuracy = 100*TPall / (number of all protein in the dataset)
where MCCDNA, MCCRNA, MCCDRNA, and MCCnonDRNA denote the MCC values when using the DNA, RNA, DRNA, and nonDRNA outcomes as the positives, and TPall is the number of correctly predicted outcomes (DNA proteins predicted as DNA proteins, RNA proteins predicted as RNA proteins, etc.). These measures can be computed based on the confusion matrix. You should round the values to one digit after the decimal point when reporting the accuracy, sensitivities, and specificity and to three digits after the decimal point when reporting MCC. You report must include the confusion matrix for your final/best solution.
You must also provide and summarize predictions on the blind test dataset. To do that you will compute your model using the entire training dataset (using the same design, i.e., features, values of parameters, etc., as in your best 5 fold cross validation result) and you will use this model to predict sequences from the blind test dataset. In your report, you must discuss the corresponding results on both the training and blind test dataset; on the blind test dataset you can summarize your results by explaining and comparing how many proteins were predicted with a given outcome.
Design
You need to design your predictive model to maximize its predictive performance evaluated based on averageMCC using the 5 fold cross validation on the training dataset. The design may consider:
- Use of different features to encode the input protein sequence. The data mining algorithms require a rectangular dataset with a fixed size and structure of the feature vector for each object (protein). Thus, you will need to convert the input protein sequences (that have variable length) into a fixed set of (numerical) features. Lecture set 7 includes a few suggestions.
- Selection of a subset of the input features. This could potentially speed up computation of the model, remove weak/noisy features, and reduce overfitting. Feel free to combine results of multiple feature selection methods.
- Selection of the classification algorithm that you will use to compute your model from among many algorithms that are available in RapidMiner.
- Parametrization of the selected classification algorithm(s). This involves setting values of their key parameters.
- Building a system with multiple models that are used together. For instance, you could use multiple models that predict all 4 classes and combine their results together to generate one prediction. Check the methods in RapidMiner at Operators Modeling Predictive Ensembles.
- Different ways to perform the prediction. There are at least two alternatives: use one model to predict all 4 classes vs. use 4 models to predict each of the four classes. In the latter case, you will have to combine the four results to select one best result for each protein. The advantage of the second approach is that you can choose different subsets of features and different classification algorithms and their parameters for each outcome/class.
NOTE 1: Ensure that you perform all design activities (e.g., feature selection, selection and parametrization of the classification algorithms, etc.) using the 5-fold cross validation on the training dataset. Otherwise you could overfit this dataset and your results on the test dataset could suffer.
NOTE 2: Your design should be done incrementally. Start with a simple initial solution (complete the entire design, prediction, and prediction assessment process) and gradually make your design more sophisticated with the objective to improve the predictive performance. In your report, you should clearly indicate one best set of results, which must be selected based on the cross validation results on the training dataset. Moreover, these results should be compared with your intermediate results (earlier/simpler designs, other alternatives, etc.) and with baseline results shown in Table 1, in order to justify your design choices. In your write up, report your results by adding them into Table 1. This will make it easy to compare the different alternatives. Clearly indicate which result is the best/final. You should explain how you made decisions that led you a certain direction of redesigning your model. You also should provide a convincing argument why and how your method is good/competitive in comparison to the baseline result in Table 1.
Table 1. Predictive results based on the 5-fold cross validation on the training dataset (this table is available in the Blackboard).
Outcome | Quality measure | Baseline result | Design 1 | Design 2 | Design 3 | Best Design |
DNA | Sensitivity | 6.9 | ||||
Specificity | 99.3 | |||||
PredictiveACC | 95.2 | |||||
MCC | 0.132 | |||||
RNA | Sensitivity | 39.6 | ||||
Specificity | 98.9 | |||||
PredictiveACC | 95.3 | |||||
MCC | 0.501 | |||||
DRNA | Sensitivity | 4.5 | ||||
Specificity | 100.0 | |||||
PredictiveACC | 99.7 | |||||
MCC | 0.122 | |||||
nonDRNA | Sensitivity | 98.6 | ||||
Specificity | 29.8 | |||||
PredictiveACC | 91.3 | |||||
MCC | 0.428 | |||||
averageMCC | 0.265 | |||||
accuracy | 90.8 |
3. Presentation
- 8 minutes long plus 2 minutes for questions&answer session shall describe the design, results and conclusions shall include the following parts:
- Motivation for your design. Briefly explain how you arrived at your final design.
- Description of your design. Explain (preferably with a diagram) how your method makes the predictions.
- Discussion and comparison of the quality of the achieved best results using the results on the training dataset and Table 1.
- This part is essential; see the conclusions part of your report.
4. Statement of contributions
- A short document with bullet-point style list of detailed contributions to the project for each team member. The contributions cover all aspects of the project including conceptualization and design of the methodology, implementation, testing, writing the report, preparing the presentation, making the presentation, coordination of the work, notes taking, etc.
- The contribution list for each team member should be accompanied with an estimated fraction of the total project effort, quantified in %. The effort estimates across the 5 team members must sum up to 100%. Each team should strive to balance the effort to be 20% for each team member.
- This statement will be used to distribute the project grade among the team members.
Marking
The evaluation of the project report and predictions constitutes 15% of the final mark from the course and it will consist of the following three parts:
- 30% for the quality of the report
- 20% for the quality of the design of the prediction method
- 50% for the quality of the predictions measured using the 5 fold cross validation on the training dataset and on the blind test set.
NOTE 3: For item 3, the averageMCC is the main predictive quality measure that will be used to evaluate submitted solutions but the conclusions must discuss the other quality indices as well. Bonuses of 15%, 10%, and 5% will be given to the project submissions that secure the highest, the second highest and the third highest value of averageMCC on the blind test dataset. In case of a tie the winner will be decided based on the higher value of the accuracy on the blind test dataset.
NOTE 4: MCCs that are high(er) relative to other submissions or to the baseline result in Table 1 are not necessary to receive a full mark. The most important aspect is to show substantial progress from the initial solution you should show and discuss how your best design is better when compared to your own alternative solutions and explain advantages compared to the baseline results in Table 1.
The presentation constitutes 10% of the final mark from the course and will be evaluated by the instructor, TA and your peers. The grade will consist of three parts:
- Grade assigned by the fellow students (30%). Each project group will complete a short evaluation form, see appendix A, to assess presentations of other groups. Instructor will gather and process these grades; they will be kept confidential. You should reassess and potentially revise your scores after all presentations on a given date are completed to assure consistency.
- TAs grade (30%). TA will grade the quality of presentations using Appendix B.
- Instructors grade (40%). Instructor will grade the quality of presentations using Appendix C. The presentation mark, broken into the marks from peers, TA and instructor and including comments will be send by email to the group leader before the final exam.
Name of the presenting group .
Remarks:
- For each question enter grade between 0 and 20 or between 0 and 5 (0 being the worst, 20 or 5 being the best) Optionally please add comments (both positive and negative); they will be passed along to the presenting group.
- Average of these grades across all groups will be used to come up with the 30% of the peer-evaluation component.
remarks | grade | |
Quality of PresentationDid you find the presentation interesting? Were the presenters prepared? Did you understand the topics covered in the presentation? How much did you learn? Was there anything significant missing? Were the conclusions and discussion of results covered sufficiently? How would you rate handling the discussion/questions? | min 0, max 20 | |
Presentation StyleQuality of presentation style Was it finished on time? Too fast/slow? Well presented? Was the presenter just reading the slides or was (s)he presenting the material beyond the content of the slides? Was there an eye contact? | min 0, max 5 | |
Quality of SlidesQuality of slides Did you find the slides too crowded? Too brief? Too many? Easy to read? Was the layout of individual slides appropriate and consistent? How was the overall quality of the organization, in terms of the order and flow of the slides? | min 0, max 5 | |
Additional Comments | ||
TAs Evaluation Form for Project Presentations
Date (circle the correct date) December 3, 2019 or December 5, 2019
Name of the presenting group .
TASK | grade | max grade | |
Quality of Motivation for the proposed design | 4 | ||
Quality of the Description of the proposed design | 4 | ||
Quality of the Discussion and comparison of the quality | 4 | ||
Quality of Conclusions | 8 | ||
Quality of the Presentation and Presentation Style | 10 | ||
TAs total mark | 30 |
Instructors Evaluation Form for Project Presentations
Date (circle the correct date) December 3, 2019 or December 5, 2019
Name of the presenting group .
TASK | comments | grade | max grade |
Submission of presentation on time | (up to -4 points penalty) | Y / 0 | |
Presentation finished on time | (up to -4 points penalty) | Y / 0 | |
Quality of Motivation for the proposed design | 5 | ||
Quality of the Description of the proposed design | 5 | ||
Quality of the Discussion and comparison of the quality | 5 | ||
Quality of Conclusions | 10 | ||
Quality of the Presentation and Presentation Style | 15 | ||
Instructors total mark | 40 |
Reviews
There are no reviews yet.