Description
Translation
Write all your code in `task1.sh` (for task 1) and `answers.py` (for the other tasks). Run the script with `python3 answers.py` to make sure that your code works.
Task 1
*This task should be performed in a linux environment.* Use Docker, Ubuntu on Windows or `ephesus` the server that some of you have access.
Write your answers in `task1.sh`.
Clone your repository onto your machine (docker or server, `ephesus`).
Go to the relevant directory with `cd YOUR_REPO_NAME/_lab/4`
Download files
Below are two files in `FASTA` format for _Escherichia coli_, the most famous model bacterium. Download them using `wget`.
Genes:
“`
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_cds_from_genomic.fna.gz
“`
Proteins:
“`
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_protein.faa.gz
“`
> Extract the files.
> Count the number of entries (headers) in each file using `grep` (Hint: consider what is common in the header). Feel free to get help from the `lab-1 key`.
Note that these files are in `FASTA` format. This format includes sequences with a header line, which starts with `>` followed by one or multiple sequence lines.
> Calculate what the difference is between total sequence numbers in two files. Print the difference using `echo`.
“`
VAR_A $(grep -c “^” somefile1.txt)
VAR_B $(grep -c “^” somefile2.txt)
echo Difference: $(( $VAR_A – $VAR_B ))
“`
> Not all the genes are coding (CDS). Find out how many of the genes are CDS. Hint: Coding sequences are annotated with `gbkey CDS`. Is the number you found equal to the protein count?
> The difference two numbers could be due to the pseudogenes that are also annotated as CDS however they don’t code for a meaningful protein. As you expect, psuedogenes are not translated and therefore no corresponding protein sequence can be found in the `faa` file. Pseudogenes are annotated in the header with `pseudo true`. Please count how many pseudogenes there are in the `fna` file. Does it match to the difference you observed between `faa` and `fna` files?
> Add `task1.sh` with `git add task1.sh`
> Commit your work with a message. Eg “Task 1 is completed”
Task 2
Create a function named `fastareader` that takes the File Name to read as an input and returns a dictionary where keys are headers and values are sequences.
Notes:
* Header shouldn’t have ‘>’ at the beginning
* Each line will have a new line character `\n` at the end. Make sure to remove them with `strip()`
> Add `answers.py` with `git add answers.py`
> Commit your work with message “Task 2 is completed”
Task 3
Create a function named `translate` that takes DNA and codon_table as dictionary as inputs and returns the corresponding protein sequence.
“`
def translate(DNA, codon_table):
protein ”
Your code here
return protein
“`
> Add `answers.py` with `git add answers.py`
> Commit your work with message “Task 3 is completed”
Task 4
Get sequence dictionaries from two files `faa` and `fna`.
“`
proteinDict fastareader(‘GCF_000005845.2_ASM584v2_protein.faa’)
DNADict fastareader(‘GCF_000005845.2_ASM584v2_cds_from_genomic.fna’)
“`
Save sequences protein sequences in a list with `proteinSequenceList list(proteinDict.values())`
Write a loop that iterates through each header and sequence of `DNADict`. Then, translate the protein using the `translate` function that you created in `Task 3`. Within the loop, check whether your translated protein is actually found in the protein sequence list (`protein_sequence_list`). If a sequence is found in proteins file (i) increment `geneExist` variable and (ii) write headers and sequences into `proteins_found.faa` file in `FASTA` format. If the sequence is not found, write it into `proteins_not_found.faa` in `FASTA` format and increment `geneDoesntExist`.
Example:
“`
import genetic_code
for header in DNADict.keys():
if thisIsProteinCoding True: Change the if statement here
Hint: if it is CDS and not a pseudogene
DNAsequence DNADict[header]
proteinSequence translate(DNAsequence, genetic_code.universal)
More code here
“`
> How many sequences that you translated matched with a protein from `faa` file? How many of them were not in the `faa` file. Report the numbers and show your work.
> Add necessary files with `git add answers.py proteins_found.fa proteins_not_found.faa`
> Commit your work with message “Task 4 is completed”
Task 5
When you investigate the proteins that you weren’t able to found in `faa` file has no `M` as the first amino acid. This means that some genes do not have `AUG` as the start codon, suggesting that `AUG` is not always the start codon. There are other start codons as well, however they always encode for `M`.
> Write a new translate function named `new_translate` that forces `M` to be implemented as the first amino acid no matter what the first codon is.
> Perform the similar comparison in `Task 4` to see whether all proteins that you translated with `new_translate` can be found in `faa` file. Write the sequences that are not found exactly in proteins list to a file named `proteins_not_found_afterForcedMethionine.faa`. Increment `geneExist` variable if a sequence is found in `faa` file, if it doesn’t than increment `geneDoesntExist` variable.
> Read the header of the files that are not found. Realize `transl_except`. Google it and understand why your new translation function doesn’t yield the expected proteins sequences exactly. Write your answer in `answers.py` and comment them out.
> Add necessary files with `git add answers.py proteins_not_found_afterForcedMethionine.faa`
> Commit your work with message “Task 5 is completed”
Task 6
> Report the percentages of start codons for protein-coding genes. Write them into a file `start_codons.tsv` in a table format. Use Python. Show your work.
Expected outcome:
“`
ATG 90.36%
ATT 0.09%
CTG 0.05%
TTG 1.79%
GTG 7.71%
“`
> Add `answers.py` with `git add answers.py start_codons.tsv`
> Commit your work with message “Task 6 is completed”
Push your work to GitHub!
“`
git push
“`