Computational-Biology Lab 4-Solution

$30.00 $24.00

Translation Write all your code in `task1.sh` (for task 1) and `answers.py` (for the other tasks). Run the script with `python3 answers.py` to make sure that your code works. Task 1 *This task should be performed in a linux environment.* Use Docker, Ubuntu on Windows or `ephesus` the server that some of you have access.…

5/5 – (2 votes)

You’ll get a: zip file solution

 

Description

5/5 – (2 votes)

Translation

Write all your code in `task1.sh` (for task 1) and `answers.py` (for the other tasks). Run the script with `python3 answers.py` to make sure that your code works.

Task 1

*This task should be performed in a linux environment.* Use Docker, Ubuntu on Windows or `ephesus` the server that some of you have access.

Write your answers in `task1.sh`.

Clone your repository onto your machine (docker or server, `ephesus`).

Go to the relevant directory with `cd YOUR_REPO_NAME/_lab/4`

Download files

Below are two files in `FASTA` format for _Escherichia coli_, the most famous model bacterium. Download them using `wget`.

Genes:

“`

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_cds_from_genomic.fna.gz

“`

Proteins:

“`

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_protein.faa.gz

“`

> Extract the files.

> Count the number of entries (headers) in each file using `grep` (Hint: consider what is common in the header). Feel free to get help from the `lab-1 key`.

Note that these files are in `FASTA` format. This format includes sequences with a header line, which starts with `>` followed by one or multiple sequence lines.

> Calculate what the difference is between total sequence numbers in two files. Print the difference using `echo`.

“`

VAR_A $(grep -c “^” somefile1.txt)

VAR_B $(grep -c “^” somefile2.txt)

echo Difference: $(( $VAR_A – $VAR_B ))

“`

> Not all the genes are coding (CDS). Find out how many of the genes are CDS. Hint: Coding sequences are annotated with `gbkey CDS`. Is the number you found equal to the protein count?

> The difference two numbers could be due to the pseudogenes that are also annotated as CDS however they don’t code for a meaningful protein. As you expect, psuedogenes are not translated and therefore no corresponding protein sequence can be found in the `faa` file. Pseudogenes are annotated in the header with `pseudo true`. Please count how many pseudogenes there are in the `fna` file. Does it match to the difference you observed between `faa` and `fna` files?

> Add `task1.sh` with `git add task1.sh`

> Commit your work with a message. Eg “Task 1 is completed”

Task 2

Create a function named `fastareader` that takes the File Name to read as an input and returns a dictionary where keys are headers and values are sequences.

Notes:

* Header shouldn’t have ‘>’ at the beginning

* Each line will have a new line character `\n` at the end. Make sure to remove them with `strip()`

> Add `answers.py` with `git add answers.py`

> Commit your work with message “Task 2 is completed”

Task 3

Create a function named `translate` that takes DNA and codon_table as dictionary as inputs and returns the corresponding protein sequence.

“`

def translate(DNA, codon_table):

protein ”

Your code here

return protein

“`

> Add `answers.py` with `git add answers.py`

> Commit your work with message “Task 3 is completed”

Task 4

Get sequence dictionaries from two files `faa` and `fna`.

“`

proteinDict fastareader(‘GCF_000005845.2_ASM584v2_protein.faa’)

DNADict fastareader(‘GCF_000005845.2_ASM584v2_cds_from_genomic.fna’)

“`

Save sequences protein sequences in a list with `proteinSequenceList list(proteinDict.values())`

Write a loop that iterates through each header and sequence of `DNADict`. Then, translate the protein using the `translate` function that you created in `Task 3`. Within the loop, check whether your translated protein is actually found in the protein sequence list (`protein_sequence_list`). If a sequence is found in proteins file (i) increment `geneExist` variable and (ii) write headers and sequences into `proteins_found.faa` file in `FASTA` format. If the sequence is not found, write it into `proteins_not_found.faa` in `FASTA` format and increment `geneDoesntExist`.

Example:

“`

import genetic_code

for header in DNADict.keys():

if thisIsProteinCoding True: Change the if statement here

Hint: if it is CDS and not a pseudogene

DNAsequence DNADict[header]

proteinSequence translate(DNAsequence, genetic_code.universal)

More code here

“`

> How many sequences that you translated matched with a protein from `faa` file? How many of them were not in the `faa` file. Report the numbers and show your work.

> Add necessary files with `git add answers.py proteins_found.fa proteins_not_found.faa`

> Commit your work with message “Task 4 is completed”

Task 5

When you investigate the proteins that you weren’t able to found in `faa` file has no `M` as the first amino acid. This means that some genes do not have `AUG` as the start codon, suggesting that `AUG` is not always the start codon. There are other start codons as well, however they always encode for `M`.

> Write a new translate function named `new_translate` that forces `M` to be implemented as the first amino acid no matter what the first codon is.

> Perform the similar comparison in `Task 4` to see whether all proteins that you translated with `new_translate` can be found in `faa` file. Write the sequences that are not found exactly in proteins list to a file named `proteins_not_found_afterForcedMethionine.faa`. Increment `geneExist` variable if a sequence is found in `faa` file, if it doesn’t than increment `geneDoesntExist` variable.

> Read the header of the files that are not found. Realize `transl_except`. Google it and understand why your new translation function doesn’t yield the expected proteins sequences exactly. Write your answer in `answers.py` and comment them out.

> Add necessary files with `git add answers.py proteins_not_found_afterForcedMethionine.faa`

> Commit your work with message “Task 5 is completed”

Task 6

> Report the percentages of start codons for protein-coding genes. Write them into a file `start_codons.tsv` in a table format. Use Python. Show your work.

Expected outcome:

“`

ATG 90.36%

ATT 0.09%

CTG 0.05%

TTG 1.79%

GTG 7.71%

“`

> Add `answers.py` with `git add answers.py start_codons.tsv`

> Commit your work with message “Task 6 is completed”

Push your work to GitHub!

“`

git push

“`

Computational-Biology Lab 4-Solution
$30.00 $24.00