[Solved] CSCI1300/1310-Assignment 4-Measuring DNA Similarity

$25

File Name: CSCI1300/1310-Assignment_4-Measuring_DNA_Similarity.zip
File Size: 480.42 KB

SKU: [Solved] CSCI1300/1310-Assignment 4-Measuring DNA Similarity Category: Tag:
5/5 - (1 vote)

Objectives:

  1. Apply the concept of program decomposition.
  2. Manipulate strings. Design functions in program construction that use pointers.

DNA is the hereditary material in human and other species. Almost every cell in a persons body has the same DNA. All information in a DNA is stored as a code in four chemical bases: adenine (A), guanine (G), cytosine (C) and thymine(T). Different order of these bases means different information.

One of the challenges in computational biology is determining what the codons in a DNA sequence represent. A codon is a sequence of three nucleotides that form a unit of genetic code in a DNA or RNA molecule. For example, given the sequence GGGA, the codon could be a GGG or a GGA, depending on where the gene begins and the active bases in the gene. Clues about how to interpret a DNA sequence can be found by comparing an unknown DNA sequence to a known sequence and measuring their similarity. If sequences are similar, then it can be hypothesized that they have similar functions and proteins.

Assignment Details:

In this assignment, you will develop a few functions for DNA analysis. These functions will calculate common measures of DNA similarity, such as the Hamming distance and the Best Match between two DNA sequences. Each of the DNA sequences you need for this assignment can be copied from this write-up and stored in a variable in your program. There is a sample DNA sequence for a mouse, human, and an unknown species. Your mission is to determine the identity of the unknown by comparing it to the human and the mouse. If the unknown species is more similar to the human than it is to the mouse, then you can conclude that the unknown sequence is from a human. Otherwise, you can conclude that the unknown is from a mouse.

4

Your assignment needs to include at least the following functions for full credit:

void calculateSimilarity(double *similarity, string DNA1, string DNA2);

string calculateB estMatch(double *bestscore, int *index, string DNA1, string DNA2);

What to submit?

The assignment is set up in moodle. There are three code-runner questions. The first question is for the full code (directives, prototypes, definitions and main) to test for correctness. The other two questions will help for getting partial credit and they are for the functions cal culateSimilarity and calculateBestMatch. It is your responsibility to copy them from the full code into these individual questions. If you need any helper functions, then include them in the space provided but above the function that will make the call. If you are not sure, please ask the instructors before the deadline. Dont forget that after the deadline, you have the right to request an interview with your TA to review your work.

Hamming distance and similarity between two strings

Hamming distance is one of the most common ways to measure the similarity between two strings of the same length. Hamming distance is a position-by-position comparison that counts the number of positions in which the corresponding characters in the string are different. Two strings with a small Hamming distance are more similar than two strings with a larger Hamming distance.

Example:

first string = ACCT second string = ACCG

A C C T

| | | *

A C C G

In this example, there are three matching characters and one mismatch, so the Hamming distance is one.

The similarity score for two sequences is then calculated as follows:

similarity_score = (string length-hamming distance) / string length

similarity_score = (4-1)/4=3/4=0.75

Two sequences with a high similarity score are more similar than two sequences with a lower similarity score.

The Best Match algorithm extends the Hamming distance calculation by finding the best overlap of the two strings. For any two strings, calculate the Hamming distance between the string and substring starting at each position of the string.

calculateSimilarity(double*, string, string)

The calculateSimilarity() function should take two arguments that are both strings and a double pointer that stores the similarity between the strings. You can declare a double pointer just as you would an integer pointer:

double x;

double *dPtr = &x;

The function should calculate the similarity score for the two strings and update the similarity with that score.

Note: when you test calculateSimilarity() , pass in strings where you can calculate the similarity by hand before passing it real data. That will help you identify errors in your algorithm.

calculateBestMatch(double*, int*, string, string)

The calculateBestMatch() function should take four arguments one integer pointers and double pointer and two strings. The double pointers store the Similarity Score calculation and the integer pointer store the index in the string where the best match starts. The two string arguments are the two strings to compare. The second string argument is the substring to search for. The first string is the string you are searching. This functions returns a string which is DNA sequence from the mouse/human DNA which best matches with the user entered sequence with a high similarity score.

Note: you will need to be aware of the end of each string to make sure that you dont loop off the end of either string.

Functionality in main()

In your main() function, you will need to call the other functions you have written. You need to use the mouse and human DNA samples shown below in this write-up and unknown DNA sample just for testing your program. Your first task is to ask the user to enter the unknown DNA sequence and store it in a variable. You should output the result of the function calls in the main() function. After calling calculateSimularity(), you need to output the identity of the unknown DNA sequence.

if the unknownDNA is more similar to the humanDNA print Human

else if the unknownDNA is more similar to the mouseDNA print Mouse

else unknownDNA is equally similar to both mouse and human print Identity cannot be determined.

Before calling calculateBestMatch(), you need to prompt the user for a search string. You need to compare the search string to the mouse DNA and Human DNA, you would do something like the following:

cout<<Enter a substring:; getline(cin, subStr);

calculateBestMatch(&similarityscore, &index, mouseDNA, subStr);

calculateBestMatch(&similarityscore, &index, humanDNA, subStr);

After calling calculateBestMatch() , you need to display the DNA sequence that is the best match as well as the best similarity score. If there isnt a match of any character, print Match not found.

Here is the skeleton/high level code which you need to follow:

#include <iostream> #include <string.h>

using namespace std;

//Declare the humanDNA as const string & initialize with the value mentioned in writeup //Declare the mouseDNA as const string & initialize with the value mentioned in writeup

//Declare the Function Headers

int main()

{

//Declare all the necessary variables needed for your program

cout<<Enter an unknownDNA:<<endl; //ask the user to enter the unknowDNA

cin>> unknownDNA;

//Call the calculateSimilarity function to compare unknownDNA with humanDNA

//Call the calculateSimilarity function to compare unknownDNA with mouseDNA

//If the unknownDNA is more similar to the humanDNA cout << Human <<endl;

//Else If the unknownDNA is more similar to the mouseDNA cout << Mouse <<endl;

//Else unknownDNA is equally similar to both mouse and human

cout<< Identity cannot be determined<<endl;

cout<<Enter a substring:<<endl; / /Ask the user to enter a sequence cin>> subStr;

//Pass the user entered sequence with humanDNA and mouseDNA to the

//calculateBestMatch function

calculateBestMatch(&similarityscore, &index, humanDNA, subStr);

//If a best match is found print in the below format

cout<< The Best Match found in humanDNA is << bestsequence << with a similarity score of << similarityscore <<endl;

//else print

cout<<Match not found<<endl;

calculateBestMatch(&similarityscore, &index, mouseDNA, subStr);

//If a best match is found print in the below format

cout<< The Best Match found in mouseDNA is << bestsequence << with a similarity score of << similarityscore <<endl;

//else print

cout<<Match not found<<endl; return 0;

}

Other helpful hints

Write the code without pointers first. Once you are confident that your algorithms are correct, modify your functions to use pointers.

DNA samples

Use the following human, mouse, and unknown DNA strings in your program.

humanDNA =

CGCAAATTTGCCGGATTTCCTTTGCTGTTCCTGCATGTAGTTTAAACGAGATTGCCAG

CACCGGGTATCATTCACCATTTTTCTTTTCGTTAACTTGCCGTCAGCCTTTTCTTTGAC

CTCTTCTTTCTGTTCATGTGTATTTGCTGTCTCTTAGCCCAGACTTCCCGTGTCCTTTC

CACCGGGCCTTTGAGAGGTCACAGGGTCTTGATGCTGTGGTCTTCATCTGCAGGTGTCT

GACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAGCTGAGCACTGGAGTGGAGTTTTC

CTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGCCATTGTTCATG

mouseDNA =

CGCAATTTTTACTTAATTCTTTTTCTTTTAATTCATATATTTTTAATATGTTTACTAT

TAATGGTTATCATTCACCATTTAACTATTTGTTATTTTGACGTCATTTTTTTCTATTTC

CTCTTTTTTCAATTCATGTTTATTTTCTGTATTTTTGTTAAGTTTTCACAAGTCTAATA

TAATTGTCCTTTGAGAGGTTATTTGGTCTATATTTTTTTTTCTTCATCTGTATTTTTAT

GATTTCATTTAATTGATTTTCATTGACAGGGTTCTGCTGTGTTCTGGATTGTATTTTTC

TTGTGGAGAGGAACTATTTCTTGAGTGGGATGTACCTTTGTTCTTG

unknownDNA =

CGCATTTTTGCCGGTTTTCCTTTGCTGTTTATTCATTTATTTTAAACGATATTTATAT

CATCGGGTTTCATTCACTATTTTTCTTTTCGATAAATTTTTGTCAGCATTTTCTTTTAC

CTCTTCTTTCTGTTTATGTTAATTTTCTGTTTCTTAACCCAGTCTTCTCGATTCTTATC

TACCGGACCTATTATAGGTCACAGGGTCTTGATGCTTTGGTTTTCATCTGCAAGAGTCT

GACTTCCTGCTAATGCTGTTCTGTGTCAGGGTGCATCTGAGCACTGATGTGGAGTTTTC

TTGTGGATATGAGCCATTCATAGTGTGGGATGTGCCATAGTTCATG

Additional practice problems:

If you are interested in additional problems for practice, here are a few more problems you can work on. We wont be grading them, but a little extra practice never hurt anyone.

Complement sequence function

DNA is double stranded even though we always write a single strands nucleotide sequence. This is because we can infer the other stands sequence from the given strand. Every A on one strand has a complementary T on the opposite strand (every C has a G). Therefore we can create the complement strand by swapping the nucleotides with the complementary nucleotide. Create a function to take the sequence and produce the complement sequence.

Reverse complement sequence function

You now have the complement strand, but the DNA is read in the forward direction for the given strand and in the reverse direction for the complementary strand. Therefore, we must reverse the strand (first character becomes the last character, second becomes second to last, and so on.) to print it correctly. Create a function to reverse the complement of a given DNA sequence. Now we can search both strands of the given DNA sequences by using our previously created functions.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[Solved] CSCI1300/1310-Assignment 4-Measuring DNA Similarity
$25