This dataset comes from an experiment on SARS-CoV-2 infection of human lung cells. This experiment used RNASeq to profile gene expression in cells infected with the virus and control cells that had a mock infection treatment.
The data for this assignment is located at the path /pvol/data/covid
on the class rstudio server. Run the following commands to create a symbolic link to this in the data
directory of your assignment
cd data
ln -s /pvol/data/covid .
The following files are present within the covid
data directory;
File Name | Purpose |
---|---|
SraRunTable-2.txt |
A comma separated values file obtained from the NCBI Short Read Archive. You can ignore most columns in this file. The important ones are Run and Group . |
*.fastq.gz |
Raw reads. One file is provided per sample. These correspond to 75bp single-end reads in compressed fastq format. The codes SRRXXXX should match those in the Run column of SraRunTable-2.txt |
GCF_000001405.39_GRCh38.p13_rna.fna |
The human transcriptome. Contains one sequence per transcript (mRNA sequence) in the human genome |
GCF_009858895.2_ASM985889v3_genomic.fna |
The genome of the virus that causes covid, (SARS-CoV-2). This is a single sequence that encodes several transcripts. The virus employs the human cell machinery to express proteins from these transcripts. |
combined.* |
These files constitute combined sequences from the Human transcriptome and SARS-CoV-2 genome. The combined dataset has been indexed for use with bowtie2 |
transcript_to_gene_map.txt |
This file encodes the relationship between genes and their transcripts. |
Note that the combined.*
files are provided for your convenience and to save disk space. They were created by running the following commands (provided for your info. You don’t need to rerun these).
# Concatenate Human and SARS-CoV-2 Sequences
cat GCF_000001405.39_GRCh38.p13_rna.fna GCF_009858895.2_ASM985889v3_genomic.fna > combined.fna
# Generate a bowtie2 index
rsem-prepare-reference --transcript-to-gene-map transcript_to_gene_map.txt --bowtie2 -p 6 combined.fna combined
This is a rich dataset, from which many analyses are possible. An obvious starting point would be to;
rsem-calculate-expression
to align reads and quantify gene expression for each sample against the combined
database. This follows a very similar workflow to the one used in the RNASeq assignment except that this time you would be working with single-end reads. This illustrates the command for a single file
rsem-calculate-expression -p 6 --bowtie2 data/covid/SRR12937417_2M.fastq.gz data/covid/combined cache/SRR12937417_2M
tximport
PCA
DESeq2
to find genes differentially expressed between infected and uninfected samples.When interpreting your results remember that one of the sequences in your analysis is actually the viral genome. This sequence has the ID NC_045512.2
.
How many genes overall are differentially expressed? If there are many such genes you might consider performing an enrichment analysis to see if certain Gene Ontology terms are over-represented in infected or mock samples. You might want to adopt an approach based on ClusterProfiler, covered by Ulf Schmitz in his lecture on the topic.
An alternative method for doing this is described in 2021 in a coding lecture on interpreting differential expression results coding lecture 7 - 2021 and uses the R package topGO
. You can use any these approaches, or even a simple web-based approach such as GOrilla