matlab assignment: predicting the biodegradability of chemicals from QSAR1 data
description
It is important to know if a particular chemical is easily biodegradable to prevent accumulation and spread in the environment 1. This is because some chemicals may be damaging to animals, plants etc. and, to protect the environment, many countries have introduced legislation. Clearly, conducting experiments to determine which chemicals are readily biodegradable is expensive so the question naturally arises: can we predict biodegradability by some other means? QSAR is a modelling technique used in chemoinformatics representing important aspects of chemistry. These data provide QSAR information and the experimentally known biodegradability of each sample for 1055 chemicals.You need know nothing about chemistry or QSAR to conduct this task.
The data for this are provided in matrix, D, in the mat-file, qsar.mat2. The first 10 columns correspond to the results of QSAR analysis of each chemical3 and the final column, to the target (1-biodegradable; 0-not biodegradable). Remember: zero is a number.
task
The task is to build a fully-validated, predictive model for biodegradability using whatever model you prefer according to currently accepted best-practice. Use any of the models we have looked at and developed in class or something else (if you are feeling confident) but it must be in matlab. The use of automatic model-building tools such as the matlab in-built neural network fitting tool is NOT permitted I need to see your model-building strategy.
1. Prepare a report describing clearly what you have done and why, summarizing your results in a convincing way and justifying any choices made. You should include performance indicators4 and plots of the kind we have used in the labs. Things you might consider include: what kind of data normalization is needed, if any, and why?; why did you choose the particular structure/model/approach?; how does your result compare to a simple benchmark?; how did you control complexity?; what was your strategy for validation and final testing?; how do you know your final model is acceptable?; what would you recommend to your client or boss?. Analyse and discuss the graphs. Critical evaluation and going beyond what was done in the lab classes will attract higher scores. The reader should be able to repeat your experiments unambiguously from your description. Do NOT restate the task description: it wastes space. There are plenty of choices such as LASSO, PCA or something else youve researched for yourself.
Study the marking criteria before you begin.
2. Prepare a matlab script that loads the data, runs your code and presents your results. The results should be in a clear and easily comprehended form with adequate documentation in the form of comments. Your code will be run by the examiner as part of the assessment. You must also save the final versions of any models produced.
If your code doesnt run you will lose marks so test it carefully.
See below for details of how to save your work.
1 Mansouri et al, (2013) Quantitative StructureActivity Relationship Models for Ready Biodegradability of Chemicals, J Chemical Information & Modelling, 53, 867-878.
2 Download from MOLE.
3 These have been reduced from the original 41 by PCA.
4 You are free to choose any performance indicators that you think appropriate but you must explain why.
rules
In addition to your ability to carry out the data modelling task, this assignment is testing your professional engineering skills: working to a specification, providing tested, commented, operational code, providing what the client has asked for and justifying your method.
report: you are permitted a maximum of four A4 sides of 11 point type and 25mm margins. If you exceed the limit5 you will be penalized. You must save your document as a pdf file only no other format is acceptable. You will be penalized and may be asked to re-submit (once only) if you do not adhere to this.
Your registration number must appear at the top of each page. Do not include your name. code: you must save the script and all additional functions that are needed for it to run6
(use zip7 format if more than one).
Do not include dmmilab.
If the code does not run, you will be penalized and may be asked to re-submit (once only). You can assume that I have the full, current version of Matlab8 and dmmilab installed:
it is your responsibility to test your code
models: you must save your final models in a MAT file, they should be called myglm,
mymlp, myrbf my etc. along with x_star & z_star
e.g. save dm123456789 myglm mymlp x_star z_star
files: You must use your nine digit registration number to name your report, matlab script and MAT files as per the following examples:
dm123456789.pdf, dm123456789.m & dm123456789.mat or
dm123456789.pdf, dm123456789.zip & dm123456789.mat
Failure to do so incurs a penalty and you may be asked to re-submit (once only). submission: submit the three files via the MOLE assignment system. penalties: see marking scheme
5 if pdf reader indicates more than four pages, the remainder will be lost dont waste space on cover sheets or unnecessary descriptions of the problem domain.
6 this includes all additional m-files that are not part of the matlab installation or dmmilab (these must be either your own or be freely available).
7 other compression formats are not permitted.
8 as provided on the university managed service.
Reviews
There are no reviews yet.