Type: IndividualProblem OverviewNobel Prize week has just ended, so this assignment will have a nice tie-in to both what weve been talking about in lecture and to these prestigious prizes in science. The 1978 Nobel Prize in physiology and medicine was awarded to Werner Arber, Dan Nathans, and Hamilton Smith for their discovery, study, and application of restriction enzymes. Quoting from the press release1:Restriction enzymes provide the chemical knives to cut genes (= DNA) into defined fragments.These may then be used (1) to determine the order of genes on chromosomes, (2) to analyse the chemical structure of genes and of regions of DNA which regulate the function of genes, and (3) to create new combinations of genes. These techniques open up new avenues to study the organisation and expression of genes of higher animals and to solve basic problems in developmental biology. In medicine, increased knowledge in this area should help in theprevention and treatment of malformations, hereditary diseases and cancer.This assignment will simulate the use of restriction enzymes to cut a DNA strand (DNA cleavage) and splice in new genetic material to form recombinant DNA. The science behind this is fascinating but mostly irrelevant to the assignment itself. None of us have to understand the science in order to complete the assignment. The main points of the assignment are (1) to provide experience in developing a software solution in the context of a real problem, (2) to provide experience in developing programs that build linked lists, and (3) to provide insight into how linked lists can offer efficiency advantages over arrays in certain circumstances.The simulation that were implementing is a simplification of one part of the chemical process known as molecular cloning. Weve provided files for you to work with: DnaStrand.java This is an interface that describes the abstract behavior necessary to simulateDNA being cleaved by a restriction enzyme and having new genetic material spliced in. ArrayStrand.java This is a class that implements DnaStrand and uses an array (a String, specifically)to store the DNA information (the nucleotides). This class is completely implemented foryou and provides a reference for the expected behavior. LinkedStrand.java Only the shell of this class is provided for you, and it is the implementationof this class that comprises most of the work for the assignment. DNASimulation.java This is a class that simulates the repeated cleaving and splicing of DNAstrands using restriction enzymes. Its output displays various statistics about the simulation to helpyou gauge the efficiency of the underlying DnaStrand implementations.1http://www.nobelprize.org/nobel_prizes/medicine/laureates/1978/press.html
COMP 2210 A5report This is a plain text file that contains the shell of a lab report for this assignment. You must provide good answers to each of the questions/prompts provided. Make sure you keep this a plain text file. Using Markdown2 is fine (it you want), but keep it text. ecoli.dat This is a plain text file that contains the DNA sequence3 for a strain of Escherichia coli4(E. coli) bacteria (4,639,201 base pairs). ecolimed.dat This is a plain text file that contains a subset of the DNA sequence for a strain of E.coli (320,101 base pairs).To complete the assignment, you must do the following things.(1) Benchmark the array-based implementation. You will use the DNASimulation class to benchmarkthe ArrayStrand implementation of DnaStrand, and document this in A5report. Specifically:a. Show that ArrayStrand.cutAndSplice() is O(N) where N is the size of the recombined strandreturned by the method. Document your work in the appropriate section of A5report. You can useeither ecoli.dat or ecolimed.dat for the simulation.b. Determine the largest power-of-two sized splice (the string spliced into the DNA strand) thatthe simulation supports without generating a java.lang.OutOfMemoryError when run using the-Xmx512M heap size request. Document your work in the appropriate section of A5report. Youmust use ecoli.dat with the restriction enzyme EcoRI (gaattc) for the simulation, and the lengthof the splice string must be a power of two.c. Double the size of heap memory available to the simulation (i.e., -Xmx1024M). With twice as muchmemory available, can your machine support the large splice that it couldnt handle before? If so,how long did it take to produce this next-larger splice? Document your work in the appropriatesection of A5report. You must use ecoli.dat with the restriction enzyme EcoRI (gaattc) for thesimulation, and the length of the splice string must be a power of two.d. Keep doubling the size of heap memory until your machine can no longer support the request.What splice sizes and running times do you observe for each heap doubling? Document your workin the appropriate section of A5report. You must use ecoli.dat with the restriction enzyme EcoRI(gaattc) for the simulation, and the length of the splice string must be a power of two.(2) Implement LinkedStrand. After completing all the analyses in step 1, you must design, code, and testthe LinkedStrand class. A description of the approach to the implementation is given a later section,and you must adhere to this description.(3) Empirically Verify LinkedStrands Efficiency. Once you have implemented LinkedStrand correctly,you must empirically verify that the LinkedStrand implementation is more efficient than ArrayStrand,and document this in the appropriate section of A5report. Specifically:a. Show that the time complexity of LinkedStrand.cutAndSplice() is independent of N where Nis the size of the recombined strand returned by the method. That is, you will show that thenew implementation is O(1) with respect to the size of the recombinant DNA. Document your workin the appropriate section of A5report. You can use either ecoli.dat or ecolimed.dat for thesimulation.2https://help.github.com/articles/markdown-basics/3http://www.genome.wisc.edu/sequencing.htm4http://en.wikipedia.org/wiki/Escherichia_coliPage 2 of 7COMP 2210 Fall 2014b. Determine the largest power-of-two sized splice (the string spliced into the DNA strand) thatthe simulation supports without generating a java.lang.OutOfMemoryError when run using the-Xmx512M heap size request. Document your work in the appropriate section of A5report. Youmust use ecoli.dat with the restriction enzyme EcoRI (gaattc) for the simulation, and the lengthof the splice string must be a power of two.c. Double the size of heap memory available to the simulation (i.e., -Xmx1024M). With twice as muchmemory available, can your machine support the large splice that it couldnt handle before? If so,how long did it take to produce this next-larger splice? Document your work in the appropriatesection of A5report. You must use ecoli.dat with the restriction enzyme EcoRI (gaattc) for thesimulation, and the length of the splice string must be a power of two.d. Keep doubling the size of heap memory until your machine can no longer support the request.What splice sizes and running times do you observe for each heap doubling? Document your workin the appropriate section of A5report. You must use ecoli.dat with the restriction enzyme EcoRI(gaattc) for the simulation, and the length of the splice string must be a power of two.Restriction Enzyme CleavingWell need a little background information on DNA and restriction enzyme cleaving to understand thecode youll be writing, modifying, and analyzing.DNA is formed by two biopolymer strands coiled around each other in the shape of a double helix.A strand of DNA is composed of nucleotides, which in turn are composed (in part) of four protein bases guanine (G), adenine (A), thymine (T), and cytosine (C). As a simplification, we will model DNA ashaving only one strand instead of two, and we will represent this single strand as simply a sequence ofbases, like this: AGTCGAATTCAAGTCGAATTCAGTCA.Enzymes5 are biological catalysts responsible for many biochemical processes. Restriction enzymes6are enzymes that cut a DNA strand at particular locations. An example restriction enzyme is EcoRI7:GAATTC. A restriction enzymes locates each occurrence of its pattern in the DNA strand and cuts thestrand into two pieces at each of those points (the binding sites or restriction sites), leaving either blunt orsticky ends. Other chemical processes can be used to bind genetic material to these blunt or sticky endsand produce recombinant DNA. As a simplification, we will ignore the distinction of blunt v. sticky ends,and we will model the cuts and splices as simply string deletion/replacement.There are two distinct cut-related DnaStrand methods: cutWith(enzyme) and cutAndSplice(enzyme,splice). The cutWith method cuts the strand of DNA at the first binding site of the specified enzyme.This method replaces its strand with the sequence that precedes the binding site and returns the sequenceafter the binding site. The cutAndSplice method cuts the strand of DNA at every binding site of thespecified enzyme, replacing the enzyme sequence with the specified splice sequence. This method leavesthe original DNA strand unchanged and returns the new recombinant DNA.As examples of these two operations, consider the DNA sequence shown in Figure 1 and the restrictionenzyme shown in Figure 2. Note that there are two occurrences of the restriction enzyme (i.e., two bindingsites) in the DNA, as shown in Figure 3.A G T C G A A T T C A A G T C G A A T T C A G T C AFigure 1: A sample DNA strand.5http://en.wikipedia.org/wiki/Enzyme6http://en.wikipedia.org/wiki/Restriction_enzyme7http://en.wikipedia.org/wiki/EcoRIPage 3 of 7COMP 2210 Fall 2014G A A T T CFigure 2: The restriction enzyme EcoRIA G T C G A A T T C A A G T C G A A T T C A G T C AFigure 3: A DNA strand with EcoRI binding sites highlighted.Let dna be the DNA strand in Figure 1, let enzyme be the restriction enzyme in Figure 2, and let splicebe the DNA strand XXYY. The behavior of the two cut methods is as follows: The effect on dna of dna.cutWith(enzyme) is shown in Figure 4. The return value of dna.cutWith(enzyme) is shown in Figure 5. The effect on dna of dna.cutAndSplice(enzyme, splice) is shown in Figure 1; i.e., no effect. The return value of dna.cutAndSplice(enzyme, splice) is shown in Figure 6.A G T CFigure 4: Effect on dna of dna.cutWith(enzyme).A A G T C G A A T T C A G T C AFigure 5: Return value of dna.cutWith(enzyme).A G T C X X Y Y A A G T C X X Y Y A G T C AFigure 6: Return value of dna.cutAndSplice(enzyme, splice).An Array-based ImplementationGiven the discussion above, an array-based implementation of DnaStrand is an intuitive and natural fit.Specifically, the DNA strand information could be stored as an array of strings, and we can rely on methodsfrom class String and StringBuilder to implement the DnaStrand methods. In this implementation,Figures 1 through 6 look identical to our underlying representations (minus the indexes).The class ArrayStrand is a full implementation of the DnaStrand interface using strings as the underlyingphysical representation. Your first step in the assignment will be to study this implementation.Make sure you understand the implementation of each DnaStrand method, and make sure you understandeach String and StringBuilder method that is used. jGRASP viewers should be very helpfulhere. For example, Figure 7 shows the source code for the cutAndSplice method, and Figure 8 shows ajGRASP viewer canvas depicting the execution of this method during a debug session.Although this array-based implementation may be intuitive, it is inefficient with respect to both timeand memory usage. For example, the operation that physically splices in new DNA (recombined.append(splice)in Figure 7) is O(N) for a splice of length N. Thus the cost of creating a recombinant strand, that is, thetime complexity of cutAndSplice with an N-length splice with B binding sites is B N, which is O(N).Page 4 of 7COMP 2210 Fall 2014public DnaStrand cutAndSplice(String enzyme, String splice){String toSplit = +dnaInfo+ ;String[] all = toSplit.split(enzyme);if (all[0].trim().equals(dnaInfo)){return EMPTY_STRAND;}StringBuilder recombined = new StringBuilder(all[0]);for (int k = 1; k < all.length; k++) {recombined.append(splice);recombined.append(all[k]);}return new ArrayStrand(recombined.toString().trim());}Figure 7: Source code for ArrayStrand.cutAndSplice.Figure 8: A jGRASP viewer canvas during the execution of ArrayStrand.cutAndSplice.The amount of memory required is also linear in the size of the splice. Memory on the order ofB N + M is required to splice an N-length strand into a M-length strand with B binding sites. Thismore than the time complexity significantly limits the scalability of the array-based solution.Alternate Implementation: Linked ListsWe can make a significant improvement in efficiency by basing our implementation on a linked list ratherthan a single array/string. A linked node representation will enable cutAndSplice and append to beimplemented by adding nodes to a linked list rather than copying arrays.You must develop LinkedStrand as an alternate implementation of DnaStrand based on linked nodes,subject to the following items.Page 5 of 7COMP 2210 Fall 2014 The node class has been written for you as a nested class in LinkedStrand. Do not make anychanges to the Node class. The LinkedStrand class must maintain two fields, front and rear that reference the first and lastnodes respectively of the linked list. These fields have been declared for you as public. Do notchange the visibility of these fields. They must be public. An empty LinkedStrand object has front and rear set to null. This is embodied in the parameterlesscontructor provided for you. Do not change this constructor. A non-empty LinkedStrand object starts out with exactly one node, which front and rear bothreference. For example, the effect of new LinkedStrand(AGTCGAATTCCATTG) is shown in Figure 9. The size method must be O(1). The cutWith method must be O(1). The cutAndSplice method must be O(1) with respect to the size of the splice, but may be linearwith respect to the number of binding cites of the enzyme. The cutAndSplice method must not call the String.split() method like the ArrayStrand implementationdoes. Instead, it should call the String.indexOf(str, fromIndex) method and one ormore of the String.substring() methods. The cutAndSplice method must not create a larger string, append to a string, or use a StringBuilderas the ArrayStrand implementation does. Instead, it must create and link new nodes to model splicingmaterial into several binding sites. For example, if dna is the LinkedStrand shown in Figure 9,the the return value of dna.cutAndSplice(GAATTC, XXYY) is shown in Figure 10. The cutWith and cutAndSplice methods should only operate on newly initialized strands; thatis, those containing a single node. These methods throw an IllegalStateException if the strandcontains multiple nodes. The append(DnaStrand) and append(String) methods must be O(1).A G T C G A A T T C C A T T G frontrearFigure 9: A newly initialized LinkedStrandA G T C X X Y Y C A T T G front rearFigure 10: A LinkedStrand after a cutAndSplice operation.Page 6 of 7COMP 2210 Fall 2014Assignment SubmissionYou must turn in both LinkedStrand.java and A5report to Web-CAT for grading. While not required,it is strongly recommended that you also submit your own test cases. Note the following rules regardingyour Web-CAT submissions: The feedback hints for failed test cases will be unavailable beginning one day prior to the assignmentdue date. You can submit to Web-CAT no more than ten times for this assignment. The last submission that you make to Web-CAT will be used to determine your grade on the assignment,even if its score is lower than that of an earlier submission. Submissions made within the 24 hour period after the published deadline will be assessed a latepenalty of 15 points. No submissions will be accepted more than 24 hours after the published deadline.AcknowledgmentThis assignment was adapted from an idea and assignment originally developed by Owen Astrachan.Page 7 of 7
Reviews
There are no reviews yet.