The goal of this project is to write a program that mimics the process of protein synthesis in eukaryoticcells. The first half focuses on transcription and translation. The second half introduces the conceptof mutation.Background InformationAll living organism store their genetic information in chains of nucleic acid. All eukaryotes (i. e.Organisms whose cells contain membrane-bound organelles, or little organs) use deoxyribonucleicacid, or DNA, as the hard drive where information is stored. DNA is composed of four distinctnucleobases: adenine, thymine, cytosine, and guanine, which are abbreviated by their first letter (A,T, C, G). A chain of nucleobases form a DNA strand. Although one strand is enough to storeinformation, each eukaryotic cell contains two complementary copies that bind to each other to form adouble helix. The rules of base pairing are as follows: A and T pair together, C and G pair together.Fun Fact: The human genome contains roughly 2.9 billion base pairs. If unwound in a straight line,this would amount to about 2 m in length. Thanks to ingenious folding techniques, our cells are ableto store DNA in their nucleus, which is only 6 microns across (1 micron is a millionth of a meter). As ifthis werent impressive enough, remember that each cell contains two strand of DNA!Task A: TranscriptionEach gene codes for a protein, and transcription is the first step of gene expression. Most proteinsynthesis occurs in organelles known as ribosomes, which are located outside of the nucleus whereDNA is stored. To relay information to a ribosome, the cell makes a copy of the relevant gene fromDNA and sends that copy out of the nucleus. The copy is called a messenger ribonucleic acid, ormRNA. Like DNA, mRNA is made of the same nucleobases, except for one: it does not containthymine [T], but instead contains uracil [U]. That means that the complement of [A] in mRNA is [U].As such, the rules of complementation in mRNA are as follows:[A] becomes [U][T] becomes [A][C] becomes [G][G] becomes [C]Your task is to write a program called transcriptase.cpp that reads a text file calleddna.txt that contains one DNA strand per line, which looks as follows:AAGATGCCGATGCCGTAAGATGCGGTAAGATGCCCGTAAGATGCCGTA. . .and outputs to the console (terminal) the corresponding mRNA strands. Each output line mustcontain exactly one mRNA strand. This is a sample output of the program:$ ./transcriptaseUUGUACGGCUACGGCAUUCUACGCCAUUCUACGGGCAUUCUACGGCAU. . .Recall that to read from a file, the following code snipet can be used:ifstream fin(dna.txt);if (fin.fail()) {cerr << File cannot be read, opened, or does not exist.
;exit(1);}string strand;while(getline(fin, strand)) {cout << strand << endl;}fin.close();The best way to do this is in two steps. First create a function that gives the complement of a base,and then write another function that uses it iteratively over a whole strand. For example, we couldhave char DNAbase_to_mRNAbase(char) to return the complement of a base and stringDNA_to_mRNA(string) that uses it for each base in the strand. Note that the output must be incapital letters, regardless of how the input is formatted. To do this, you may include the<cstdlib> and use int toupper(int c) , which returns the upper case of any alphacharacter passed to it.Task B: TranslationWhile a nucleotide is the basic unit of information, three nucleotides, or codon, is the basic unit ofstorage. The reason for this is that each gene codes for a protein, and all proteins are made from 20amino acids. Recall that there are 4 different bases that make up dna. Thus, three bases can encodefor 4x4x4 = 64 different symbols. Two base pairs can only encode for 44 = 16 symbols, which is notenough.For this task, you will need the following dictionary file: codons.tsv . It contains 64 lines, each withtwo columns. In the first column are the codons, and in the second are the corresponding amino acid.Your task is to write a program called translatase.cpp that given strands of DNA (taken fromdna2b.txt ), outputs to the console the corresponding amino-acid chain. Feel free to use yourcode from Task A to convert the DNA into mRNA to match the codons in the dictionary. Notice thatthere are 4 special codons: Met, which stands for Methionine, and 3 Stop codons. Methionine isthe first amino acid in every protein chain and as such serves as the Start codon. This means thattranslation does not begin until the AUG codon, which encodes for methionine, has been read. Thethree Stop codons, UAA, UGA, and UAG, are not included in the protein chain and simply signify theend of translation. The rules of formatting are as follows:Use the three-letter abreviation from the dictionary for each amino acidInsert a hyphen after each amino acid except for the lastThe first amino acid should always be MetStop codons should not be inserted e. g. tacaacact would produce Met-Leu .For this task, you will need to have two ifstream objects open. One for the dna file, and one for thedictionary of codons file. The same code segment from Task A can be adapted to read dna2b.txtsince we only read it once. However, for each codon in each of the DNA strand, we need to perform adictionary lookup. It would not be very efficient to open, read, and close the file each time. The reasonis because repetitive file access can become expensive and slow in the long run. The betteralternative is to open the file once with one ifstream object, pass it by reference, and reset thefile pointer to the beginning for each look up. This can be done with seekg(0) . Below is anexample that shows how to read from a file that has two fields per line where the delimiter is a space.You can modify this code to perform a look-up in codons.tsv .void dictionary_read(ifstream &dict) {string key, value;dict.clear(); // reset error statedict.seekg(0); // return file pointer to the beginningwhile (dict >> key >> value) {cout << key: << key << endl;cout << value: << value << endl;}}N. B. To make this task a bit easier, the DNA strands are multiples of 3, and must be read as such.This means that you do not need to scan a strand one base at a time until the first AUG. Rather, scanit three bases at a time from the begining of the strand, and start translation at the first AUGencountered in this manner.Background Information: MutationsMany factors, such as environmental condition, random chance, and errors in handling, can result in achange, or mutation, in the DNA sequence. These changes can range from benign to catastrophicdepending on their effects. There are four kinds of mutations.Substitution: a base pair is replaced with another (e. g. aac -> agc).Insertion: a new base pair is added somewhere in the DNA strand.Deletion: a base pair is removed from the sequence.Frameshift: any of the above mutation, or combination thereof, that causes codons to beparsed incorrectly.Task C: Substitution and Hamming DistanceFor this task, we will explore mutations that occur by substitution. Your task is to write a programcalled hamming.cpp that calculates the Hamming distance between two strings. Given twostrings of equal length, the Hamming distance is the number of positions at which the two stringsdiffer. e. g.: Hamming(aactgc, atcaga) would output 3. Notice that certain amino acidsare encoded by multiple codons. Therefore, not all substitutions result in a change of proteinstructure. The file mutations.txt contains an even number of lines (zero-indexed). The evennumberedlines contain the original DNA sequence, and the odd-numbered lines contain that samesequence with substitution mutations. For each pair in mutations.txt , output to the console theHamming distance followed by yes or no whether the substitution caused a change in structure.Example:$ ./hamming0 no17 yes4 yesRemember that translation to proteins does not begin until the first Start codon, and stops at the firstStop codon, and unlike the Start codon, the Stop codon is not included in the protein chaintranslation; it simply signifies the end of translation.Task D: Insertion, Deletion, and FrameshiftThe worst type of mutation is the frameshift mutation, as it causes the DNA sequence to be parsedincorrectly. This is often created by a deletion or insertion that causes the sequence to be read in adifferent multiple of three. This abnormal reading often results in an earlier or later Stop codon,which causes the protein to be abnormally short or long, thus rendering it not functional.So far, the codons in DNA sequences have been multiples of 3. The fileframeshift_mutations.txt contains the same DNA sequences of Task B on the even lines,with frameshift mutations on the odd lines (0-indexed). Each mutation has at most one insertion orone deletion. Your task is to write a program called frameshift.cpp that compares the results ofTask B with the mutated strands. To do this, you will need to parse the strands one nucleotide at atime as the Start codon is not a guaranteed multiple of 3 from the begining. Your output should bethe original protein on the even lines, and the mutated protein on the odd lines. Example:$ ./frameshitMet-Thr-Pro-Tyr-Val-ValMet-Thr-ProMet-Gly-Gly-Leu-TyrMet-Gly-Gly-Leu-TyrMet-Gly-Thr-Ala-Ala-Asp-Pro-Arg-Arg-GlyMet-Gly-Thr-Ala-Ala-Asp-Ala-Lys-Ala-Gly-Leu
CS135
[Solved] CS135 Project2- Protein Synthesis
$25
File Name: CS135_Project2-_Protein_Synthesis.zip
File Size: 310.86 KB
Only logged in customers who have purchased this product may leave a review.
Reviews
There are no reviews yet.