BIOC334/434 homework Bioinformatics (by Focco van den Akker, 2024)
This is a homework problem that is actual original research and could lead to interesting discoveries. You’ll be trying to assign possible biological functions to proteins that have been labeled/annotated as having an ‘ Unknown Function’ .
There are two proteins you’ll be researching. In the first part, you’ll select a protein with unknown function for which the structure already has been determined. In the second instance, the protein structure has likely not been determined, only its sequence is known. You need to print out and hand in a 6-10 page PowerPoint document with image snapshots of the name of the proteins, amino acid sequences, as well as all the key results you get from the bioinformatics websites as well as write a few sentences describing what each finding means. Then for each of the two proteins you studied, add a paragraph at the end summarizing what you have find out about it (i.e. likely structure, likely function, possible interaction partners, etc). Feel free to cut-and-paste multiple snapshot outputs on a single page to keep it within the page limits. Once finished and the Powerpoint file is saved, save it also as a PDF file (filename lastname_firstname_BIOCx34_HW.pdf) and email that .pdf file to me at [email protected]. It is due Wednesday November 20th at 11:59 pm. Please start early as some of the bioinformatics tools/servers can take a day before you get the results emailed back.
1. Research possible functions for protein with Unknown Function for which there is a structure known:
Go to the PDB:www.rcsb.org
Click on ‘Advanced Search’ (link just under the regular search box on the top). On the Advanced Search Query Builder
section and then in the Structure Attribute section, click on the double down arrow ( ) to select ‘Structure Title’ . In the search box to the right of that that searches for ‘contains phrase’ enter ‘ Unknown function’ and then hit the enter button. This should yield around 340 hits. Now select a structure as follows (so you all don’t pick the same one being the first listed): remember which day you are born on (i.e. the 15th of March) and divide that by 2 (14/2 = 7.5). Round that number down to a whole number (if the number is 15, choose 14 as there are only 14 pages). Select page of hits by clicking the arrows at the bottom. At that page, select the xth hit from the top that is your birth month (i.e., March = 3). After you have chosen the PDB target to analyze, select it by clicking on either the 4 digit PDB identifier/code or the title of the entry (please keep track of the 4 digit PDB identifier as you need it for input in subsequent bioinformatics tools (the 4 digit PDB identifier has both letters and numbers). Download the PDB file of the coordinates onto your computer: Click on ‘Download Files’ and select ‘PDB format’ . This will download a coordinate file with the extension .pdb.
Now also download the amino acid sequence file: Click again on ‘Download Files’ and select ‘FASTA Sequence’ . This will download a sequence file with the extension .fasta.txt
Do the following steps (please labeled the steps ‘1’, ‘2’, etc in your homework so I can tell what you completed):
1. First, let’s check if the model matches the experimentally determined electron density well so you are confident of the structure. Go towww.rcsb.orgof that PDB structure with the PDBid. Under the picture on the left showing the structure, click on ‘electron density’ . The structure in 3D view should show up. Click on a region in an alpha helix or beta-strand with a single left mouse button click and the density will show up (2Fo-Fc shows all atoms, Fo-Fc(+ve) is positive difference electron density where they should have added some atom(s); Fo-Fc(-ve) is negative difference electron density where they should not have placed an atom(s). Just click on various regions in the protein (especially a ligand if present) and you will find that the inner core has better defined 2Fo- Fc density and surface region likely have poorer density (more flexible/disorder). Snap an image of a good region of electron density and include it in your homework and mention in 1 sentence your confidence if the crystal structure was well built and refined. If you can find a definite error in the structure and show it in the figure, that would be very impressive but don’t worry if you cannot find an error. Btw. If you have chosen an NMR structure, there will of course be no electron density to calculate; in that case, click to the right at wwPDB Validation -> 3D Report and image snap some relevant details regarding the validity of this NMR structure.
2. Use the Dali Server to find similar structures (input the protein coordinate .pdb file), note that you should add your email address as this server can take up to a day to email you the results. Being similar to another protein could yield clues to the structure of your protein and perhaps having a similar function as that protein):
http://ekhidna2.biocenter.helsinki.fi/dali/
3. Predict ligand binding sites on your protein using PrankWeb Server:
https://prankweb.cz/
4. Sometimes inside the crystal lattice (if you picked a crystal structure which is quite likely), there could be some molecules forming important dimer, trimer, or other types of oligomeric arrangements. Please check that out using the following servers:
http://eppic-web.org/ewui/
http://www.ebi.ac.uk/pdbe/pisa/
5. Now go to the BLAST protein website to find database protein matches by using only the protein amino
sequence to search with:
http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome
This BLAST search might yield some clues about function if you detect weak homologs. Paste in the protein amino sequence that you downloaded from the PDB (the FASTA sequence file). Search the ‘Non-redundant protein sequence (nr)’ data base first. First, keep the Algorithm as ‘ blastp (protein-protein BLAST).
Feel free to try also the PSI-BLAST and PHI-BLAST, and DELTA-BLAST algorithms (depending if obtained any useful hits or not).
6. Do a STRING search to find proteins that are predicted to interact with your protein (could also yield clues to the
function):
http://string-db.org/#
7. Paragraph summarizing the above results from points 1.1-1.6
2. Research possible functions for protein with Unknown Function for which there is no structure known:
1. First, let’s make sure all students are not selecting and studying the same protein: let’s find an organism for
which you’ll be studying a protein with Unknown Function based on your name. Go to:
http://string-db.org/
click Search, and click on the down pointing triangle right of the box labeled “organism:” and select an organism that starts with either the first letter of your first name or last name by scrolling that list.
Then, go to the PubMed site:
https://www.ncbi.nlm.nih.gov/
and change the selection in the box in the left upper corner from ‘All databases’ to ‘Protein’ . For someone whose name started with an ‘A’ I selected for example, the organism group acetobacter. Then enter into the search box “ Unknown function [Title] acetobacter “ (search by everything listed in red included the quotation marks around Unknown function but do change the organism name of course). The resulting list has the names as well as the amino acid size listed (### aa protein). Select a protein from this list that is at least 200 aa (=amino acids) by clicking on the ‘FASTA’ link under the hit to download the amino acid sequence.
2. Now go to the BLAST protein website to find database protein matches for which there are structures
determined:
http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome
This BLAST search might yield some clues about similar structure/function if you detect weak homologs. Paste in the protein amino sequence that you downloaded above (the FASTA sequence file). Change the Database to be searched in to ‘ Protein Data Bank proteins (pdb)’ so you are searching sequences for which a structure has been determined. Keep the Algorithm as ‘ blastp (protein-protein BLAST). Check out the results.
3. Run PSIPRED on that amino acid sequence to predict what fold and function your protein has (select as many
algorithms by checking the boxes):
http://bioinf.cs.ucl.ac.uk/psipred/ )
4. Do the same using HHpred:
https://toolkit.tuebingen.mpg.de/#/tools/hhpred
5. As well as using InterProScan:
http://www.ebi.ac.uk/interpro/search/sequence-search
6. Do a STRING search to find proteins that are predicted to interact with your protein:
http://string-db.org/#
7. Build a homology model of your protein using either SWISS-MODEL, check the AlphaFold databased of modeled
protein structures, or AlphaFold (not as easy to navigate):
http://swissmodel.expasy.org/
or
already modeled AlphaFold models in their database:
https://alphafold.ebi.ac.uk/
or
a de novo AlphaFold 3 prediction :
https://alphafoldserver.com/about
8. Paragraph summarizing the above results from points 2.1-2.7
From any of the above results, there are likely some clues about structural similarity to your protein via the modeled results. Download the coordinates from this hit either directly from the bioinformatics website or download it from the PDB. Then view the coordinates using protein structure viewers that are either embedded in the results of the website or viewers like Pymol, Web3DMol, POLYVIEW-3D, COOT (just google the websites; if your protein did not yield any structural similarity hits, select the protein you selected from your Part 1 of this homework):
Some of these are web browser-based viewers so you don’t need to install any program if that’s what you prefer. Inspect the structure using the viewer you selected and also test some of the viewer options. Take a few snapshots of this protein structure to include in your homework printout.
3. Only for BIOC 434 students: from the output of the BLASTP search of Section 1-6 above, select 4 sequences of varying sequence identity and save those 4 as FASTA sequence files (so not all ~99.9 % sequence identity but preferably much lower). Then make a multiple sequence alignment using Clustal Omega with your sequence and the 4 you obtained from the BLASTP search. Remember that you should add ‘>name’ above the sequence:
https://www.ebi.ac.uk/Tools/msa/clustalo/
(see my lecture notes as well on this topic and input example).
Only show the phylogenetic tree and the sequence identity table in your output (see my lecture notes) Attach the example homework at the end of this PDF.
Remember for each of the outputs for everything you’ve done above, to take 1 or more image snapshots to highlight key results and describe in a few sentences what it means right next to it. Also, don’t forget to include final summarizing paragraphs for each of the proteins you studied to describe your bioinformatics findings regarding possible structure, function, etc. In your final paragraph, you could also include suggestions for experiments to test your function/structure predictions.
Also, to be successful, you need to demonstrate that you can figure out how to understand the output of the various bioinformatics websites on your own by finding and reading the article that describes the bioinformatics tool you are using.
Note that I have 1 example homework included on Canvas but that is an old example and does not have all of the items (7 for section 1, and 7 for section 2, and Section 3 for only BIOC 434 students).
Reviews
There are no reviews yet.