[SOLVED] DNA chain Excel 2

$25

File Name: DNA_chain_Excel__2.zip
File Size: 169.56 KB

5/5 - (1 vote)

2

What your programs MUST do:
You must adhere to the programming specification for this assignment in order
to receive full credit. Also you must use command line options for these
programs, so pay particular attention to the example usage I have provided.
See the solution #1 to assignment 2 for more on command line options.
Command line options will automatically put checks into place for things like:
When an option should be an integer (i), it will check that automatically if you
use :

columnToParse=i =>$COL,
But you must check to see if $COL was passed in, the module will not do that
for you with the way I have implemented this in my solution #1.
1 The data file for this program can be found here (Note this file has been

gzipped Make sure to gunzip). Suppose we had a company identify all of
the amino acids for a group of unknown proteins using Mass Spectrometry
(De novo peptide sequencing for mass spectrometry is typically performed
without prior knowledge of the amino acid sequence. It is the process of
assigning amino acids from peptide fragment masses of a protein) and we
also had them computationally determine the secondary structure of those
peptide chains.

2
3 We had asked them to send us two files. One fasta file with the protein

sequences, and the other with the corresponding secondary structure data.
But as it turns out they sent us one file, Arghhhhhh!! You are to write a
program (pdbFastaSplitter.pl) which will open this file and generate two
files. One with the corresponding protein sequence (pdbProtein.fasta), and
the other with the corresponding secondary structures (pdbSS.fasta). Make
sure to keep the white spaces intact in pdbSS.fasta because the position
corresponds to the amino acid in pdbProtein.fasta. At the end of the program
tell the user how many sequences were found for each of the output files, by
printing this out to STDERR.

4
5 Here is an example of what the first two sequences looks like: Gaps in the

secondary structure just mean there is no secondary structure annotation.
6
7
8 >101M:A:sequence
9

MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRVKHLKTE
AEMKASEDLKKHGVTVLTALGA

10
ILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGNFGADA
QGAMNKALELFRKDIAAKYKEL

11 GYQG
12 >101M:A:secstr

http://155.33.203.128/teaching/BIOL6200-Spring2017/local/assignments/assign_arrays_solutions.html
http://155.33.203.128/teaching/BIOL6200-Spring2017/local/assignments/data_fasta/ss.txt.gz

13 HHHHHHHHHHHHHHGGGHHHHHHHHHHHHHHH GGGGGG TTTTT
SHHHHHH HHHHHHHHHHHHHHHH

14 HHTTTTHHHHHHHHHHHHHTS HHHHHHHHHHHHHHHHHH GGG
SHHHHHHHHHHHHHHHHHHHHHHHHT

15 T

16 pseudocode for your program

get file handle to sequences in the fasta file (use the getFh subroutine, see
below).

open two outfiles named: pdbProtein.fasta and pdbSS.fasta (use
the getFh subroutine, see below).

loop over fasta filehandle and store the data in two arrays, one for the
header line and the other for the sequence data (use
the getHeaderAndSequenceArrayRefs subroutine, see below).

process the two arrays, go over the header array, and if it matches the amino
acid sequence print it to pdbProtein.fasta, otherwise print it
to pdbSS.fasta.

Examples of the program being run (Follow the same output):

a $ perl pdbFastaSplitter.pl -infile ss.txt
Found X protein sequences
Found X ss sequences

# Comment only Note: the X below in the output should be
the actual number of sequences found

b $ perl pdbFastaSplitter.pl -h
c pdbFastaSplitter.pl [options]
d
e
f
g Options:
h
i -infile Provide the fasta sequence file

name to do the splitting on, this file contains
j two entries for each sequence, one

with the protein sequence data, and one with
k the SS information

l -help Show this message
m
n $ perl pdbFastaSplitter.plDyingMake sure to

provide a file name of a sequence in FASTA
format

o
p pdbFastaSplitter.pl [options]
q
r
s
t Options:
u
v -infile Provide a fasta sequence file name

to do the splitting on, this file contains
w two entries for each sequence, one

with the protein sequence data, and one with
x the SS information
y -help Show this message
z
a
b $ perl pdbFastaSplitter.pl -infile

ss_designed2Fail.txt
The size of the sequences and the header arrays
is different
Are you sure the FASTA is in correct format
at pdbFastaSplitter.pl line 115, <$fh> line 511.

c
# Note the at pdfFastaSplitter.pl line . will
have different line numbers from your code dying. You can use
the following file to test if it dies correctly

17 You must implement the following subroutines (Name them exactly
as instructed, and provide the same arguments and call them in the
same context as instructed. Failure to do so will result in points
being deducted):

a Write a subroutine (call it getFh, called in scalar context)
that receives two arguments: 1). How to open a file >, or
<“. 2). A file name. The purpose of this subroutine is to open the file name passed in, but it should handle reading and writing. I should test for either ‘<‘ or ‘> have been passed in and die if not. I should also
test to make sure that it was a file that was sent in as the parameter,
when its an infile call. I can call this subroutine like this:

b
c my $fhIn= getFh(<‘ , $file2open); or d my $fhOut = getFh(‘> , $file2write2);

http://155.33.203.128/teaching/BIOL6200-Spring2017/local/assignments/data_fasta/ss_designed2Fail.txt.gz

e Write a subroutine (call
it getHeaderAndSequenceArrayRefs, called in list context)
that receives one argument: 1). A filehandle to the fasta file used in
this program. The subroutine will return two array references. One
for the sequences in the file and one for the headers to the sequences
in the file. There should be a one-to-one correspondence to the data.
Meaning element 1 of the header array should correspond to element
1 of the sequence array. If implemented correctly, you can call this
subroutine like this:

f
# send off the filehandle
# get back data in array references

g my ($refArrHeaders, $refArrSeqs) =
getHeaderAndSequenceArrayRefs($fhIn);

h
i This subroutine should die if it could not successfully get two arrays

of equal size (see _checkSizeOfArrayRefs).
Here _checkSizeOfArrayRefs does the dying, but that will get
carried to through the getHeaderAndSequenceArrayRefs,
which can then be tested in unit test like:

j
k # should die on this bad file

dies_ok{ getHeaderAndSequenceArrayRefs($fhInBad)
} Dies ok when the bad file is sent in;

l
m Write a subroutine (call it _checkSizeOfArrayRefs, called in

void context [Note the _ in front of the name of the
subroutine]) that receives two arguments: 1). Reference to the
header array found in step 2 directly above. 2). Reference to the
sequence array found in step 2 directly above. This is a helper function
that will be called in
the getHeaderAndSequenceArrayRefs subroutine (which is
why it starts with a _ in front of the name, since its not really to be
called in the main part of your program. Well discuss this more at a
later time). If the sizes of the arrays passed into this argument are not
the same, it should die, (telling the user why it died), else it just
returns (return;).

n
o If implemented correctly, you can call this subroutine like this:
p
q # check to make sure data looks good
r _checkSizeOfArrayRefs (@header, @seq);
s
t

18 In your second program (nucleotideStatisticsFromFasta.pl) you are to
open the following file (Note this file has been gzipped). Store the data in
two arrays like we did before in step 1. Now for each sequence I would like
to know the number of As, Ts, Gs, Cs, and any other type of nucleotide,
call all of these Ns. Id also like to know the the length of the sequence and
also the %GC content of the entire sequence.pseudocode for your
program

get file handle to sequences in the fasta file (use getFh subroutine).

open one outfile (influenzaStats.txt) (use getFh subroutine).

loop over fasta filehandle and store the data in two arrays, one for the
header line and the other for the sequence data
(use getHeaderAndSequenceArrayRefs subroutine, see below).

process the arrays, and determine the necessary output I want to see
(use printSequenceStats subroutine, see below).
Examples of the program being run:

a $ perl nucleotideStatisticsFromFasta.pl -infile
influenza.fasta -outfile influenza.stats.txt

b $ perl nucleotideStatisticsFromFasta.pl
hnucleotideStatisticsFromFasta.pl [options]

c
d Options:
e
f -infile Provide a fasta sequence file name

to complete the stats on
g -outfileProvide a output file to put the

stats into
h -help Show this message
i
j $ perl

nucleotideStatisticsFromFasta.plDyingMake
sure to provide a file name of a sequence in
FASTA format

nucleotideStatisticsFromFasta.pl [options]

k
l Options:
m

http://155.33.203.128/teaching/BIOL6200-Spring2017/local/assignments/data_fasta/influenza.fasta.gz

n -infile Provide a fasta sequence file name
to complete the stats on

o -outfileProvide a output file to put the
stats into

p -help Show this message
q
r $ perl nucleotideStatisticsFromFasta.pl -infile

influenza.fastaDyingMake sure to provide an
outfile name for the stats

nucleotideStatisticsFromFasta.pl [options]

s
t Options:
u
v -infile Provide the fasta sequence file

name to complete the stats on
w -outfileProvide a output file to put the

stats into
x -help Show this message
y
19
20 You must implement the following subroutines (Name them exactly

as instructed, and provide the same arguments and call them in the
same context as instructed. Failure to do so will result in loss of
points):

21
a Write a subroutine (call it getFh, called in scalar context)

that receives two arguments: 1). How to open a file >, or
<“. 2). A file name. The purpose of this subroutine is to open the file name passed in, but it should handle reading and writing. I should test for either ‘<‘ or ‘> have been passed in and die if not. I should also
test to make sure that it was a file that was sent in as the parameter,
when its an infile call. I can call this subroutine like this:

b
c my $fhIn= getFh(<‘ , $file2open); or d my $fhOut = getFh(‘> , $file2write2);

e Write a subroutine (call
it getHeaderAndSequenceArrayRefs, called in list context)
that receives one argument: 1). A filehandle to the fasta file used in
this program. The subroutine will return two array references. One
for the sequences in the file and one for the headers to the sequences
in the file. There should be a one-to-one correspondence to the data.
Meaning element 1 of the header array should correspond to element

1 of the sequence array. If implemented correctly, you can call this
subroutine like this:

f
# send off the filehandle
# get back data in array references

g my ($refArrHeaders, $refArrSeqs) =
getHeaderAndSequenceArrayRefs($fhIn);

h
i This subroutine should die if it could not successfully get two arrays

of equal size (see _checkSizeOfArrayRefs).
Here _checkSizeOfArrayRefs does the dying, but that will get
carried to through the getHeaderAndSequenceArrayRefs,
which can then be tested in unit test like:

j
k # should die on this bad file

dies_ok{ getHeaderAndSequenceArrayRefs($fhInBad)
} Dies ok when the bad file is sent in;

l
m
n Write a subroutine (call it _checkSizeOfArrayRefs, called in

void context [Note the _ in front of the name of the
subroutine]) that receives two arguments: 1). Reference to the
header array found in step 2 directly above. 2). Reference to the
sequence array found in step 2 directly above. This is a helper function
that will be called in
the getHeaderAndSequenceArrayRefs subroutine (which is
why it starts with a _ in front of the name, since its not really to be
called in the main part of your program. Well discuss this more at a
later time). If the sizes of the arrays passed into this argument are not
the same, it should die, (telling the user why it died), else it just
returns (return;).

o
p # check to make sure data looks good
q _checkSizeOfArrayRefs (@header, @seq);
r
s Write a subroutine (call it printSequenceStats, called in void

context) that receives three arguments. 1). Reference to the header
array found in step 2 directly above. 2). Reference to the sequence
array found in step 2 directly above. 3).The output filehandle the stats
will go to. This is the main subroutine of this program, since it
will print the top line of the output (see below), and each sequences
numerical values. It will call two helper functions
(_getAccession and _getNtOccurrence see below) that

will be called for each sequence prior to printing the data for each
sequence out. I can call this subroutine like this:

t
# send of the array references
# process the sequences and print out the data

u printSequenceStats($refArrHeader, $refArrSeqs,
$fhOut);

v
w Write a subroutine (call it _getNtOccurrence, called in scalar

context) that receives two arguments. 1). The character to find the
occurrence of in the dna sequence. 2). A reference to the sequence
data. I can call this subroutine like this:

x
y my $numAs = _getNtOccurrence(A, $seq);

my $numCs = _getNtOccurrence(C, $seq);
z
a
b This subroutine should die if it comes accross any NT not in [AGCTN].
c
d I could test this like:
e
f dies_ok {_getNtOccurrence(Z, $refArrSeqs-

>[0]) } dies on unknown Z character;
g
h Write a subroutine (call it _getAccession, called in scalar

context) that receives one argument. 1). A scalar that is the header
to the sequence. And returns the accession number. I can call this
subroutine like this:

i
j my $accession = _getAccession($header);
22

Put output into a tab delimited file (influenzaStats.txt), which looks
like this (1 decimal point):
Number Accession As GsCsT Ns
Length GC%

23 1EU52189320 20 20 200
8050.0
The numbers above are not accurate, and the Number is just incremented
with each new sequence. The Header line in your output file should look like
the one above. Also the white space above represents tabs, this way you can
easily open it in excel!

24
25
26 If you implemented this correctly, you should end up with a very

short main (4-5 lines of code) This does not include the command

line options, checking the options, closing the filehandles, or
comments)

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] DNA chain Excel 2
$25