BC3203

This dataset comes from an experiment on SARS-CoV-2 infection of human lung cells. This experiment used RNASeq to profile gene expression in cells infected with the virus and control cells that had a mock infection treatment.

Data Setup

The data for this assignment is located at the path /pvol/data/covid on the class rstudio server. Run the following commands to create a symbolic link to this in the data directory of your assignment

cd data
ln -s /pvol/data/covid .

The following files are present within the covid data directory;

File Name Purpose
SraRunTable-2.txt A comma separated values file obtained from the NCBI Short Read Archive. You can ignore most columns in this file. The important ones are Run and Group.
*.fastq.gz Raw reads. One file is provided per sample. These correspond to 75bp single-end reads in compressed fastq format. The codes SRRXXXX should match those in the Run column of SraRunTable-2.txt
GCF_000001405.39_GRCh38.p13_rna.fna The human transcriptome. Contains one sequence per transcript (mRNA sequence) in the human genome
GCF_009858895.2_ASM985889v3_genomic.fna The genome of the virus that causes covid, (SARS-CoV-2). This is a single sequence that encodes several transcripts. The virus employs the human cell machinery to express proteins from these transcripts.
combined.* These files constitute combined sequences from the Human transcriptome and SARS-CoV-2 genome. The combined dataset has been indexed for use with bowtie2
transcript_to_gene_map.txt This file encodes the relationship between genes and their transcripts.

Note that the combined.* files are provided for your convenience and to save disk space. They were created by running the following commands (provided for your info. You don’t need to rerun these).

# Concatenate Human and SARS-CoV-2 Sequences
cat GCF_000001405.39_GRCh38.p13_rna.fna GCF_009858895.2_ASM985889v3_genomic.fna > combined.fna
# Generate a bowtie2 index
rsem-prepare-reference --transcript-to-gene-map transcript_to_gene_map.txt --bowtie2 -p 6 combined.fna combined

Suggested Analyses

This is a rich dataset, from which many analyses are possible. An obvious starting point would be to;

  1. Run rsem-calculate-expression to align reads and quantify gene expression for each sample against the combined database. This follows a very similar workflow to the one used in the RNASeq assignment except that this time you would be working with single-end reads. This illustrates the command for a single file
    rsem-calculate-expression -p 6 --bowtie2 data/covid/SRR12937417_2M.fastq.gz data/covid/combined cache/SRR12937417_2M
    
  2. Import results to R using tximport
  3. Visualise relationships between samples using a PCA
  4. Use DESeq2 to find genes differentially expressed between infected and uninfected samples.

When interpreting your results remember that one of the sequences in your analysis is actually the viral genome. This sequence has the ID NC_045512.2.

How many genes overall are differentially expressed? If there are many such genes you might consider performing an enrichment analysis to see if certain Gene Ontology terms are over-represented in infected or mock samples. You might want to adopt an approach based on ClusterProfiler, covered by Ulf Schmitz in his lecture on the topic.

An alternative method for doing this is described in 2021 in a coding lecture on interpreting differential expression results coding lecture 7 - 2021 and uses the R package topGO. You can use any these approaches, or even a simple web-based approach such as GOrilla