This dataset comes from an experiment on heat stress and bleaching by Cleves et al. The organism used for this experiment is an anemone called Exaiptasia diaphana which is used as a model for coral bleaching. It is a useful model because the anemone is a facultative symbiont, meaning that it can happily exist with or without symbiotic dinoflagellate algae, allowing controlled experiments such as this one in which anemones are subjected to heat stress with and without symbionts. Such experiments are impossible in stony corals because those are obligate symbionts (they cannot survive without their algae for extended periods). The down-side of using a model such as this however is that anemones are actually only distantly related to corals so the degree to which the results apply in corals is not clear.
All of the data for this assignment were downloaded from the public domain. They are available through the NCBI under bioproject PRJNA662400. Feel free to browse the bioproject if you want to see more details about the sequencing, but don’t attempt to download that data directly as it is large. I have provided down-sampled data that is suitable for analysis in this assignment using our rstudio server (see below).
First create a directory data for storing the original data for this project and another directory called cache for storing your own analysis outputs
mkdir data
mkdir cache
The data for this assignment is located at the path /pvol/data/anemone on the class rstudio server. Run the following commands to create a symbolic link to this in the data directory of your assignment
cd data
ln -s /pvol/data/anemone .
The following files are present within the anemone data directory;
| File Name | Purpose |
|---|---|
SraRunTable.csv |
A comma separated values file obtained from the NCBI Short Read Archive. You can ignore most columns in this file. The important ones are Run, strain and treatment. |
*.fastq.gz |
Raw reads. One file is provided per sample. These correspond to 50bp single-end reads in compressed fastq format. The codes SRRXXXX should match those in the Run column of SraRunTable.csv |
cds_from_genomic.fna |
The Exaiptasia diaphana transcriptome. Contains one sequence per transcript (mRNA sequence) |
gene_transcript_map.tsv |
This file encodes the relationship between genes and their transcripts. |
annotation_table.tsv |
Functional annotation of anemone genes including GO terms. These annotations are based on homology to genes in highly studied organisms such as Humans, Mice and Fruit Flies. The swissprot_blastp column gives the Uniprot accession to the closest homolog of the anemone gene (in gene) in the Swissprot database. Columns go_name and go_term give Gene Ontology information obtained from Swissprot for this protein. |
Regardless of how you decide to analyse this data you will want to align reads and quantify gene expression for each sample. This is done using RSEM which is the same tool we used for the RNASeq assignment. In this case however;
First you will need to build your own reference index using the transcript sequences (cds_from_genomic.fna) and the gene/transcript mapping file (gene_transcript_map.tsv) as follows;
rsem-prepare-reference --bowtie2 --transcript-to-gene-map data/anemone/gene_transcript_map.tsv data/anemone/cds_from_genomic.fna cache/aiptasia_ref
Once the reference is built you can perform alignment and quantification. This time you would be working with single-end reads. This illustrates the command for a single file
rsem-calculate-expression -p 6 --bowtie2 data/anemone/SRR12613270_4M.fastq.gz cache/aiptasia_ref cache/SRR12613270_4M
Since there are 36 files you should write a script to run this over all files. You should write your script as a slurm script and submit it using sbatch. Include this script in your git repository for the assignment
When you are finished you should have 36 files with the extension *.genes.results (one for each sample).
Statistical analyses for this data are much more complex than for the RNASeq assignment because in this case we have two factors, strain (aposymbiotic, APO and symbiotic, SSB01) and time (0,3,12,24,48,96). To make things even more complicated the time variable includes other changes that happen throughout the experiment. In particular, the first heat-stress timepoint is at 3h and the symbiotic anemones start to bleach heavily at around 24h (see Figures 1A-C in the original paper).
Your first task will be to clean up the metadata file, SraRunTable.csv. Read it in, select relevant columns and separate variables where they are encoded within the same column. I recommend aiming for something like this;
## Run strain time rep
## 1 SRR12613270 APO 24h 1
## 2 SRR12613271 APO 12h 3
## 3 SRR12613272 APO 12h 2
## 4 SRR12613273 APO 12h 1
## 5 SRR12613274 APO 3h 3
## 6 SRR12613275 APO 3h 2
Next, find and label your input files. You’ll need to construct a named vector where the names are the run labels (eg SRR12613298) and the values are the relative paths to corresponding *.genes.results files.
Next use tximport to read gene expression data from all those results files.
Finally, construct a DESeqDataSet object using the DESeqDataSetFromTximport function. In my code this looked like this;
dds_anemone <- DESeqDataSetFromTximport(txi_genes,run_table, ~ strain * time)
Here txi_genes is the result from tximport, run_table is the cleaned metadata, and ~strain*time is our statistical model. This final part is important. The statistical model I have chosen here is a full model which includes fixed effects for strain and time as well as an interaction term between them. Because there are so many levels of the time variable this results in quite a complex model. A complete analysis of this model is both technically challenging and difficult to interpret. I suggest some simple approaches to tackling this below.
This complex experimental design presents a challenge to interpretation. The best approach in such situations is to start with a thorough exploratory data analysis, giving careful thought to the experimental design.
Some recommended analyses include;
PCA. This analysis alone is quite interesting for the current experiment. Think carefully about the experimental design with this and look at how the gene expression differences between symbiotic and aposymbiotic anemones changes over time.DESeq2 to find genes that show a very strong change between 0h and 3h. These genes represent those that are changing in response to the initial heat stress condition.
dds_anemone_de <- DESeq(dds_anemone)
results_heat <- results(dds_anemone_de,name = "time_3h_vs_0h")
padj) and potentially also on effect size (log2FoldChange)annotation_table.tsv which gives GO terms for the anemone genes.