[Solved] CSE231-Project 6- extract gene lengths from a gene annotation file

$25

File Name: CSE231-Project_6-_extract_gene_lengths_from_a_gene_annotation_file.zip
File Size: 621.72 KB

SKU: [Solved] CSE231-Project 6- extract gene lengths from a gene annotation file Category: Tag:
5/5 - (1 vote)

Project 06

Assignment Overview

This assignment will give you more experience on the use of:

  1. Lists and tuples
  2. function
  3. File manipulation

The goal of this project is to extract gene lengths from a gene annotation file. With a gene annotation GFF file, you will need to extract the gene coordinates on each chromosome and calculate the average and standard deviation of gene lengths.

Assignment Background

The eukaryotic genome is composed of multiple chromosomes. On each chromosome, there are multiple genes. In bioinformatics, the genome annotations can be saved in a file format called GFF. In NCBI genome database (https://www.ncbi.nlm.nih.gov/genome/), there are many publically available annotated organisms. These annotated genomes can be downloaded in multiple file formats, including GFF format. For this project, we will focus on a relatively simple model species: Caenorhabditis elegans. This worm has a genome of six chromosomes named chrI, chrII, chrIII, chrIV, chrV, and chrX.

We provide two input files:

C.elegans_small.gff # a small file for development

C.elegans.gff # a real BIG data file

Project Description

  1. open_file() prompts the user to enter a filename. The program will try to open a tab- separated GFF file (a text file). An error message should be shown if the file cannot be opened. This function will loop until it receives proper input and successfully opens the file. It returns a file pointer.
  2. read_file(fp) receivers a file pointer of the data file and read all the genes information. For this project, we are only interested in the following columns: the chromosome name (string) is in column 0, the gene_start is in column 3, and the gene_end is in column 4. Convert number strings to int. No other values are needed for this project. If a value is missing, use 0 as the value.

For each gene, save it in a tuple, (chromosome, gene_start, gene_end), and append each tuple to a list of genes. Sort the list and then return the sorted list of genes (sorting makes a canonical list for comparison testing on Mimir).

  1. extract_chromosome(genes_list, chromosome) receives a list of genes (such as what was returned by the read_file() function) and a chromosome name, extract the gene information for this chromosome and save in list chrom_gene_list. Sort and return the list (sorting makes a canonical list for comparison testing on Mimir).
  2. extract_genome(genes_list) receives a list of genes and extract the gene information for each chromosome. In this function, use extract_chromosome(genes_list, chromosome) to extract the genes for each chromosome (chrom_gene_list) then save the returned result in a list genome_list

(a list of six chrom_gene_lists). Sort and return the list (sorting makes a canonical list for comparison testing on Mimir).

  1. d) compute_gene_length(chrom_gene_list) This function receives chrom_gene_list, the list of genes for a specific chromosome (such as returned from the extract_chromosome() function). For each gene, compute the gene length and save the result in a list gene_length. Given the gene length list, compute the average gene length and standard deviation for these genes. Save the results in a tuple with gene length first followed by gene standard deviation. Return that tuple.

The length of one gene, gene_len, is calculated as: gene_len = gene_end gene_start + 1

(Hint: make a list that has the lengths of the genes.)

The gene_mean is the average length of all the genes (Hint: consider using list sum() and len() functionssomething you can do if your lengths are in a list.)

The gene_number is a count of all the genes.

The gene_stddev is the standard deviation of all the gene lengths, calculated according to the following formula. The summation in the formula sums across all genes. That is, for each gene subtract the mean from the length and square the difference; then sum those values. Take that sum and divide by the number of genes (gene_number); then take the square root (remember to import math). (Hint: for each gene can be done easily by walking through a list with a for loop, if your lengths are in a list.)

  1. e) display_data(chrom_gene_list, chrom) This function receives chrom_gene_list, the

list of genes for a specific chromosome as well as chrom, the string name. It displays the chromosome name, the average length of the gene and the standard deviation (from the compute_gene_length() function that gets called within this function). The chromosome name must be displayed with the first three characters chr in lower case and the remaining characters in upper case, e.g. chrIV. (Hint: slicing is your friend.)

Assignment Deliverable

The deliverable for this assignment is the following file:

proj06.py the source code for your Python program

Be sure to use the specified file name and to submit it for grading via the Mimir system before the project deadline.

Assignment Notes

  1. The gff input files we provide have lines starting with # as file annotations; skip these lines when reading the gff file.
  2. To parse the lines in gff file, use split() function. You can split by the tab.
  3. Use this constant for chromosome names

CHROMOSOMES = [chri,chrii,chriii,chriv,chrv,chrx]

  1. Items 1-9 of the Coding Standard will be enforced for this project.

Test Cases

Function Test 1: read_file

Input file: C.elegans_small.gff Returns:

[(chri, 3747, 3909), (chri, 4221, 10148), (chri, 11641, 16585),

(chrii, 25, 175), (chrii, 25, 175), (chrii, 1867, 4663), (chriii,

1271, 2917), (chriii, 4251, 11940), (chriii, 12189, 14753), (chriv,

695, 14926), (chriv, 8765, 11070), (chriv, 15499, 20899), (chrv, 180,

329), (chrv, 180, 329), (chrv, 2851, 6511), (chrx, 151, 263), (chrx, 151, 263), (chrx, 13494, 13643)]

Function Test 2: extract_chromosome

genes_list = [(chri, 3747, 3909), (chri, 4221, 10148), (chri, 11641,

16585), (chrii, 25, 175), (chrii, 25, 175), (chrii, 1867, 4663),

(chriii, 1271, 2917), (chriii, 4251, 11940), (chriii, 12189, 14753),

(chriv, 695, 14926), (chriv, 8765, 11070), (chriv, 15499, 20899),

(chrv, 180, 329), (chrv, 180, 329), (chrv, 2851, 6511), (chrx, 151,

263), (chrx, 151, 263), (chrx, 13494, 13643)] chromosome = chrv

Returns:

[(chrv, 180, 329), (chrv, 180, 329), (chrv, 2851, 6511)]

Function Test 3: extract_genome

gees_list = [(chri, 3747, 3909), (chri, 4221, 10148), (chri, 11641,

16585), (chrii, 25, 175), (chrii, 25, 175), (chrii, 1867, 4663),

(chriii, 1271, 2917), (chriii, 4251, 11940), (chriii, 12189, 14753),

(chriv, 695, 14926), (chriv, 8765, 11070), (chriv, 15499, 20899),

(chrv, 180, 329), (chrv, 180, 329), (chrv, 2851, 6511), (chrx, 151, 263), (chrx, 151, 263), (chrx, 13494, 13643)]

Returns:

[[(chri, 3747, 3909), (chri, 4221, 10148), (chri, 11641, 16585)],

[(chrii, 25, 175), (chrii, 25, 175), (chrii, 1867, 4663)],

[(chriii, 1271, 2917), (chriii, 4251, 11940), (chriii, 12189,

14753)], [(chriv, 695, 14926), (chriv, 8765, 11070), (chriv, 15499,

20899)], [(chrv, 180, 329), (chrv, 180, 329), (chrv, 2851, 6511)], [(chrx, 151, 263), (chrx, 151, 263), (chrx, 13494, 13643)]]

Function Test 4: compute_gene_length

chrom_gene_list = [(chrii, 25, 175), (chrii, 25, 175), (chrii, 1867, 4663)]

Returns:

(1033.0, 1247.33636201307)

Test Case 1

Gene length computation for C. elegans.

Input a file name: C.elegans_small.gff

Enter chromosome or all or quit: chri

Chromosome Length chromosome mean std-dev chrI 3678.67 2518.14

Enter chromosome or all or quit: chriv

Chromosome Length chromosome mean std-dev chrIV 7313.00 5053.00

Enter chromosome or all or quit: chrX

Chromosome Length chromosome mean std-dev chrX 125.33 17.44

Enter chromosome or all or quit: quit

Test case 2

Gene length computation for C. elegans.

Input a file name: xxx Unable to open file.

Input a file name: C.elegans_small.gff

Enter chromosome or all or quit: xxx Error in chromosome. Please try again.

Enter chromosome or all or quit: chrII

Chromosome Length chromosome mean std-dev chrII 1033.00 1247.34 Enter chromosome or all or quit: CHIII Error in chromosome. Please try again.

Enter chromosome or all or quit: CHRII

Chromosome Length chromosome mean std-dev chrIII 3967.33 2658.87

Enter chromosome or all or quit: aLL

Chromosome Length chromosome mean std-dev chrI 3678.67 2518.14 chrII 1033.00 1247.34 chrIII 3967.33 2658.87 chrIV 7313.00 5053.00 chrV 1320.33 1655.10 chrX 125.33 17.44

Enter chromosome or all or quit: qUiT

Test Case 3

Gene length computation for C. elegans.

Input a file name: C.elegans.gff

Enter chromosome or all or quit: all

Chromosome Length chromosome mean std-dev chrI 2542.65 4104.10 chrII 1879.71 2945.42 chrIII 2469.57 3761.81 chrIV 535.14 1949.55 chrV 1711.47 2687.29 chrX 1575.51 3110.69

Enter chromosome or all or quit: quit

Test Case 4

Blind test.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[Solved] CSE231-Project 6- extract gene lengths from a gene annotation file
$25