This file forms the assessable component of your RNASeq assignment.
It asks you to prepare a report on the genomic locations of differentially expressed genes.
Make sure you complete the guide before you start this report. Example code in this report assumes that you have done this.
report.Rmd
slurm_rsem_calculate_expression_1M.sh
you will write to run rsem on the 1M dataset.
Don’t commit other files, especially large data files or outputs from running analyses. Ensure your final changes are pushed to github by the deadline.pdf
format using the Knit
function in RStudio. Make sure your final pdf
is nicely formatted and is not missing any important content.knitr
options to make sure this undesirable content is hidden in the final doc. For example;``{r, echo = FALSE, warning = FALSE, message = FALSE}
# Some R code here that might produce a plot. The code shuold be hidden but the plot should be visible in the final document
``
report.Rmd
?The code in your report should include work that you have completed after completing the guide. Don’t include code from the guide as this is just clutter and will obscure your valuable work. If you play around with experimental code do not include your experiments in the report. Only include code that is relevant to production of the final report and would be required to reproduce the results that you have obtained. As stated above, this code should be hidden and should not produce any extra text (warnings or informational output) that appears in your final pdf
report.
Remember that you will be marked on code you write. Quality and not quantity is important. Be sure to format your code using a consistent and clear code style (for R code follow the tidyverse style guide). Do not include any irrelevant, incorrect or incomplete code in your report. Make sure you organise your code into chunks, each of which performs a distinct task. Use spacing effectively to improve readability. In a narrative document you would use spaces between paragraphs but not between every line, the same applies to code.
The data for this assignment is available at several different read depths. So far you have only analyzed the shallowest dataset which has 10,000 reads per sample. In this part of the assignment you will need to redo your analysis based on the 1M reads data. The additional read depth in this data will increase the power to detect differentially expressed genes and is required in order to find the genomic region responsible for differences between NNN and NN mice.
1: Write R code which downloads the available fastq data for the 1M read dataset
Look at the code for downloading the 10k reads data in the guide ) and extend it to allow for the 1M reads data. Make sure you store the 1M reads data in a different folder to the 10k reads data. Do not include this code in your report.
2: Adapt code in slurm_rsem_calculate_expression.sh
so that it runs alignment for the 1M
read dataset. Write your answer in a new file called slurm_rsem_calculate_expression_1M.sh
.
After you have got your code to work you should run it as shown below.
sbatch slurm_rsem_calculate_expression_1M.sh
Note that your original slurm_rsem_calculate_expression.sh
script had a time limit of 60 minutes. You will need to increase this (eg to 600 minutes) in order for the 1M data to run successfully.
In part 5 of the guide you discovered some differentially expressed genes. In this section you will redo this analysis using the 1M reads dataset. You should discover many more differentially expressed genes and your task will then be to map the genomic location of these genes in order to identify candidates for control of Nkt cell production.
The design of this particular study was intended to identify genes involved in controlling Nkt cell expression by contrasting expression in two specially generated strains of mice. The two mice strains are;
NOD.Nkrp1
(Coded as NN
). These are NOD Mice which are deficient in the production of certain immune cells including Nkt cells.NOD.Nkrp1.Nkt1
(coded as NNN
). The mice are the same as the NOD.Nkrp1
except that a specific region of their genome has been replaced with a segment from C57BL/6 mice which are not immuno-deficient and have normal production of Nkt cells. This region was designed carefully so that it “rescues” production of Nkt cells compared with the NOD.Nkrp1
strain. We call this region the congenic segmentThe hypothesis here is that genes within the congenic segment are responsible for control of the production of Nkt cells. We expect that the genes within the congenic segment will be more likely to be differentially expressed than genes from other regions of the genome. We also might expect that genes that differ strongly in expression between the two mice strains within this congenic segment are also likely to be good candidates for Nkt control genes.
Your task in this section is to report on, and interpret the differential expression results. You should use the 1M_reads
data for this section as it provides the more accurate quantification estimates of gene expression.
Your answer should consist of no more than 1 page of text and no more than three figures. Your answer should describe the results displayed in the figures and should explain why these results provide evidence for the location of the congenic segment. Pay special attention to your figure(s) and their captions. See the lecture on Principles of scientific graphics for advice on producing high quality figures and captions. Your figures should convey the most likely genomic location of the congenic segment, using the location of differentially expressed genes as well as other information such as the effect size and p-value for those genes. It is very important that you have a good understanding of the concept of “genomic location” and genomic coordinates. This is covered is lectures in week 9.
Help on completing this section
I have provided a function in the file R/load_data.R
that reads the mouse gene models into R. Use this function to load the mouse gene models again as follows;
source("R/load_data.R")
mouse_genes <- load_gene_models()
In order to visualise the genome coordinates of differentially expressed genes you will need to match their gene identifiers with those of the gene models. The gene identifiers look like this
ENSMUSG00000102693
Take a look at the first few rows of the mouse_genes
data. You should see that these gene identifiers are buried within the attribute
string. The attribute
string is long and contains many pieces of information concatenated together. The next step extracts the gene identifiers from the attribute
string and places them into their own column in the dataset.
mouse_genes <- mouse_genes %>%
extract(attribute, into = "gene_id",regex = "(ENSMUS[^\"]+)")