BC3203

This file forms the assessable component of your RNASeq assignment.

It asks you to prepare a report on the genomic locations of differentially expressed genes.

Make sure you complete the guide before you start this report. Example code in this report assumes that you have done this.

How to complete this report

Remove the instruction text (ie this text) and replace it with your work.
As you make changes commit them to git and push to github. There are just two files that you should commit to github for marking.
1. This file report.Rmd
2. The script slurm_rsem_calculate_expression_1M.sh you will write to run rsem on the 1M dataset. Don’t commit other files, especially large data files or outputs from running analyses. Ensure your final changes are pushed to github by the deadline.
Compile the final report to pdf format using the Knit function in RStudio. Make sure your final pdf is nicely formatted and is not missing any important content.
The compiled version of the document should contain figures and text. As a general rule it should not display code, but your code (in the .Rmd file) will form an important part of your assessment. Be sure to include code for all steps in your analysis (hidden in the .Rmd) but do not include anything more than this. Make sure your final report does not contain warnings or error messages. Use knitr options to make sure this undesirable content is hidden in the final doc. For example;

``{r, echo = FALSE, warning = FALSE, message = FALSE}
# Some R code here that might produce a plot.  The code shuold be hidden but the plot should be visible in the final document
``

What code should I include in my `report.Rmd`?

The code in your report should include work that you have completed after completing the guide. Don’t include code from the guide as this is just clutter and will obscure your valuable work. If you play around with experimental code do not include your experiments in the report. Only include code that is relevant to production of the final report and would be required to reproduce the results that you have obtained. As stated above, this code should be hidden and should not produce any extra text (warnings or informational output) that appears in your final pdf report.

Remember that you will be marked on code you write. Quality and not quantity is important. Be sure to format your code using a consistent and clear code style (for R code follow the tidyverse style guide). Do not include any irrelevant, incorrect or incomplete code in your report. Make sure you organise your code into chunks, each of which performs a distinct task. Use spacing effectively to improve readability. In a narrative document you would use spaces between paragraphs but not between every line, the same applies to code.

Re-analysis using the 1M reads dataset

The data for this assignment is available at several different read depths. So far you have only analyzed the shallowest dataset which has 10,000 reads per sample. In this part of the assignment you will need to redo your analysis based on the 1M reads data. The additional read depth in this data will increase the power to detect differentially expressed genes and is required in order to find the genomic region responsible for differences between NNN and NN mice.

1: Write R code which downloads the available fastq data for the 1M read dataset

Look at the code for downloading the 10k reads data in the guide ) and extend it to allow for the 1M reads data. Make sure you store the 1M reads data in a different folder to the 10k reads data. Do not include this code in your report.

2: Adapt code in slurm_rsem_calculate_expression.sh so that it runs alignment for the 1M read dataset. Write your answer in a new file called slurm_rsem_calculate_expression_1M.sh.

After you have got your code to work you should run it as shown below.

sbatch slurm_rsem_calculate_expression_1M.sh

Note that your original slurm_rsem_calculate_expression.sh script had a time limit of 60 minutes. You will need to increase this (eg to 600 minutes) in order for the 1M data to run successfully.

Genomic location of differentially expressed genes

In part 5 of the guide you discovered some differentially expressed genes. In this section you will redo this analysis using the 1M reads dataset. You should discover many more differentially expressed genes and your task will then be to map the genomic location of these genes in order to identify candidates for control of Nkt cell production.

The design of this particular study was intended to identify genes involved in controlling Nkt cell expression by contrasting expression in two specially generated strains of mice. The two mice strains are;

NOD.Nkrp1 (Coded as NN). These are NOD Mice which are deficient in the production of certain immune cells including Nkt cells.
NOD.Nkrp1.Nkt1 (coded as NNN). The mice are the same as the NOD.Nkrp1 except that a specific region of their genome has been replaced with a segment from C57BL/6 mice which are not immuno-deficient and have normal production of Nkt cells. This region was designed carefully so that it “rescues” production of Nkt cells compared with the NOD.Nkrp1 strain. We call this region the congenic segment

The hypothesis here is that genes within the congenic segment are responsible for control of the production of Nkt cells. We expect that the genes within the congenic segment will be more likely to be differentially expressed than genes from other regions of the genome. We also might expect that genes that differ strongly in expression between the two mice strains within this congenic segment are also likely to be good candidates for Nkt control genes.

Your task in this section is to report on, and interpret the differential expression results. You should use the 1M_reads data for this section as it provides the more accurate quantification estimates of gene expression.

Your answer should consist of no more than 1 page of text and no more than three figures. Your answer should describe the results displayed in the figures and should explain why these results provide evidence for the location of the congenic segment. Pay special attention to your figure(s) and their captions. See the lecture on Principles of scientific graphics for advice on producing high quality figures and captions. Your figures should convey the most likely genomic location of the congenic segment, using the location of differentially expressed genes as well as other information such as the effect size and p-value for those genes. It is very important that you have a good understanding of the concept of “genomic location” and genomic coordinates. This is covered is lectures in week 9.

Help on completing this section

I have provided a function in the file R/load_data.R that reads the mouse gene models into R. Use this function to load the mouse gene models again as follows;

source("R/load_data.R")
mouse_genes <- load_gene_models()

In order to visualise the genome coordinates of differentially expressed genes you will need to match their gene identifiers with those of the gene models. The gene identifiers look like this

ENSMUSG00000102693

Take a look at the first few rows of the mouse_genes data. You should see that these gene identifiers are buried within the attribute string. The attribute string is long and contains many pieces of information concatenated together. The next step extracts the gene identifiers from the attribute string and places them into their own column in the dataset.

mouse_genes <- mouse_genes %>% 
  extract(attribute, into = "gene_id",regex = "(ENSMUS[^\"]+)")

BC3203

How to complete this report

What code should I include in my report.Rmd?

Re-analysis using the 1M reads dataset

Genomic location of differentially expressed genes

What code should I include in my `report.Rmd`?