Description
—
Task 1
=> Download the CDS sequences of Escherichia coli from the following link.
`ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_cds_from_genomic.fna.gz`
=> Unzip the file.
Task 1A (`1 pt`)
=> Write a python script (`answers.py`) to retrieve the start and stop codon, and gene length for each protein_coding gene (not pseudogenes). You may benefit from the key of the previous lab (for eg `fastareader` function). Save the output in a tab-separated text file (`ecoli_startStop.tsv`).
“`
start stop length
ATG TAG 939
ATG TAA 414
ATG TGA 414
ATG TAG 600
ATG TAA 693
ATG TAA 1320
ATG TAA 1536
ATG TAA 966
ATG TGA 1110
…
“`
=> Save your work in `answers.py`.
=> Push the python script and tsv file to your GitHub repo.
Task 1B (`2 pts`)
=> Write a python code (in the same file: `answers.py`) to generate the codon usage profile and save it in a file named `ecoli_codons.tsv`. You may benefit from the `genetic_code` module to retrieve amino acids. The file will look like:
“`
codon aminoacid count
AAA K 44704
AAC N 28589
AAG K 13563
AAT N 23075
ACA T 9128
ACC T 31223
ACG T 19133
…
“`
Task 2
=> Use `R` and `ggplot2` package to generate the following plots. You may either download and use local Rstudio or use [Rstudio cloud](https://rstudio.cloud). Remember to install `ggplot2` and `dplyr` packages.
=> Save your script in `plots.R` and your plots. Push them all to GitHub.
Task 2A (`2pts`)
Generate the plots given in `task2A.pdf` and save them in a single pdf file named `ecoli_startStop.pdf`
The plots basically give:
1) Gene length distribution
2) Gene length distribution filled by start codons
3) Gene length distribution filled by start codons both x and y axes are in log scale
4) Gene length distribution filled by stop codons
5) Gene length distribution filled by stop codons both x and y axes are in log scale
6) Gene length distribution for each start (x) and stop (y) codon. Use `facet_grid`
7) Gene length distribution for each start (x) and stop (y) codon log-scaled. Use `facet_grid`
Hints:
* `ggplot(genes) + geom_histogram(aes(…))` is going to be used for every plot in this task.
* Use the following code block to save multiple plots in a single pdf
“`
genes <- read.table(‘ecoli_startStop.tsv’, sep=’\t’, header=T)
pdf(‘ecoli_startStop.pdf’)
ggplot(genes) + geom_histogram(aes(…)) Gene length distribution
ggplot(genes) + geom_histogram(aes(…)) Gene length distribution filled by start codons
ggplot(genes) + geom_histogram(aes(…)) Gene length distribution filled by start codons both x and y axes are in log scale
ggplot(genes) + geom_histogram(aes(…)) Gene length distribution filled by stop codons
ggplot(genes) + geom_histogram(aes(…)) Gene length distribution filled by stop codons both x and y axes are in log scale
ggplot(genes) + geom_histogram(aes(…)) Gene length distribution for each start (x) and stop (y) codon. Use facet_grid
ggplot(genes) + geom_histogram(aes(…)) Gene length distribution for each start (x) and stop (y) codon log-scaled. Use facet_grid
dev.off()
“`
Task 2B (`2 pts`)
Generate the codon usage profile plot as given in `task2B.pdf` and save it in a single pdf file named `ecoli_codon.pdf`
Hints:
* Use `filter` function to remove stop codons by requiring `aminoacid` to be not equal to `*`.
* Use `geom_bar(stat=”identity” aes(…))`. `…` will be filled in with correct parameters such as `x` `y` and `fill`.
* Use `facet_wrap` to generate bar plot individually for each amino acid.
Task 3 (`3 pts`)
=> Perform the same analysis in task 1 and task 2 with _`Thermotoga maritima`_. Append `answers.py` and `plots.R` with the new code for _T. maritima_. Your output pdf files should be named as `tmaritima_startStop.pdf` and `tmaritima_codons.pdf`.
The download link is the following:
`ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/008/545/GCF_000008545.1_ASM854v1/GCF_000008545.1_ASM854v1_cds_from_genomic.fna.gz`
=> Push all the output files to your repo.
=> Compare and contrast the start and stop codon usages as well as codon usage profiles between _E. coli_ and _T. maritima_. Take a special look into `Arginine (R)` codons. Write 1 paragraph on what you observe in this file, below.
Your paragraph on codon usage profiles of two organisms:
“`
“`