BC3203

This dataset comes from an experiment on heat stress and bleaching by Cleves et al. The organism used for this experiment is an anemone called Exaiptasia diaphana which is used as a model for coral bleaching. It is a useful model because the anemone is a facultative symbiont, meaning that it can happily exist with or without symbiotic dinoflagellate algae, allowing controlled experiments such as this one in which anemones are subjected to heat stress with and without symbionts. Such experiments are impossible in stony corals because those are obligate symbionts (they cannot survive without their algae for extended periods). The down-side of using a model such as this however is that anemones are actually only distantly related to corals so the degree to which the results apply in corals is not clear.

All of the data for this assignment were downloaded from the public domain. They are available through the NCBI under bioproject PRJNA662400. Feel free to browse the bioproject if you want to see more details about the sequencing, but don’t attempt to download that data directly as it is large. I have provided down-sampled data that is suitable for analysis in this assignment using our rstudio server (see below).

Data Setup

First create a directory data for storing the original data for this project and another directory called cache for storing your own analysis outputs

mkdir data
mkdir cache

The data for this assignment is located at the path /pvol/data/anemone on the class rstudio server. Run the following commands to create a symbolic link to this in the data directory of your assignment

cd data
ln -s /pvol/data/anemone .

The following files are present within the anemone data directory;

File Name	Purpose
`SraRunTable.csv`	A comma separated values file obtained from the NCBI Short Read Archive. You can ignore most columns in this file. The important ones are `Run`, `strain` and `treatment`.
`*.fastq.gz`	Raw reads. One file is provided per sample. These correspond to 50bp single-end reads in compressed fastq format. The codes `SRRXXXX` should match those in the `Run` column of `SraRunTable.csv`
`cds_from_genomic.fna`	The Exaiptasia diaphana transcriptome. Contains one sequence per transcript (mRNA sequence)
`gene_transcript_map.tsv`	This file encodes the relationship between genes and their transcripts.
`annotation_table.tsv`	Functional annotation of anemone genes including GO terms. These annotations are based on homology to genes in highly studied organisms such as Humans, Mice and Fruit Flies. The `swissprot_blastp` column gives the Uniprot accession to the closest homolog of the anemone gene (in `gene`) in the Swissprot database. Columns `go_name` and `go_term` give Gene Ontology information obtained from Swissprot for this protein.

Required Steps

Regardless of how you decide to analyse this data you will want to align reads and quantify gene expression for each sample. This is done using RSEM which is the same tool we used for the RNASeq assignment. In this case however;

We will align reads directly to a transcriptome, not a genome
We will align in single-end mode because these are single-end Illumina reads only

First you will need to build your own reference index using the transcript sequences (cds_from_genomic.fna) and the gene/transcript mapping file (gene_transcript_map.tsv) as follows;

rsem-prepare-reference --bowtie2 --transcript-to-gene-map data/anemone/gene_transcript_map.tsv data/anemone/cds_from_genomic.fna cache/aiptasia_ref

Once the reference is built you can perform alignment and quantification. This time you would be working with single-end reads. This illustrates the command for a single file

rsem-calculate-expression -p 6 --bowtie2 data/anemone/SRR12613270_4M.fastq.gz cache/aiptasia_ref cache/SRR12613270_4M

Since there are 36 files you should write a script to run this over all files. You should write your script as a slurm script and submit it using sbatch. Include this script in your git repository for the assignment

When you are finished you should have 36 files with the extension *.genes.results (one for each sample).

Suggested Data Import and Statistical Framework

Statistical analyses for this data are much more complex than for the RNASeq assignment because in this case we have two factors, strain (aposymbiotic, APO and symbiotic, SSB01) and time (0,3,12,24,48,96). To make things even more complicated the time variable includes other changes that happen throughout the experiment. In particular, the first heat-stress timepoint is at 3h and the symbiotic anemones start to bleach heavily at around 24h (see Figures 1A-C in the original paper).

Your first task will be to clean up the metadata file, SraRunTable.csv. Read it in, select relevant columns and separate variables where they are encoded within the same column. I recommend aiming for something like this;

##           Run strain time rep
## 1 SRR12613270    APO  24h   1
## 2 SRR12613271    APO  12h   3
## 3 SRR12613272    APO  12h   2
## 4 SRR12613273    APO  12h   1
## 5 SRR12613274    APO   3h   3
## 6 SRR12613275    APO   3h   2

Next, find and label your input files. You’ll need to construct a named vector where the names are the run labels (eg SRR12613298) and the values are the relative paths to corresponding *.genes.results files.

Next use tximport to read gene expression data from all those results files.

Finally, construct a DESeqDataSet object using the DESeqDataSetFromTximport function. In my code this looked like this;

dds_anemone <- DESeqDataSetFromTximport(txi_genes,run_table, ~ strain * time)

Here txi_genes is the result from tximport, run_table is the cleaned metadata, and ~strain*time is our statistical model. This final part is important. The statistical model I have chosen here is a full model which includes fixed effects for strain and time as well as an interaction term between them. Because there are so many levels of the time variable this results in quite a complex model. A complete analysis of this model is both technically challenging and difficult to interpret. I suggest some simple approaches to tackling this below.

Suggested Analyses

This complex experimental design presents a challenge to interpretation. The best approach in such situations is to start with a thorough exploratory data analysis, giving careful thought to the experimental design.

Some recommended analyses include;

Visualise relationships between samples using a PCA. This analysis alone is quite interesting for the current experiment. Think carefully about the experimental design with this and look at how the gene expression differences between symbiotic and aposymbiotic anemones changes over time.
Use DESeq2 to find genes that show a very strong change between 0h and 3h. These genes represent those that are changing in response to the initial heat stress condition.
```
  dds_anemone_de <- DESeq(dds_anemone)
  results_heat <- results(dds_anemone_de,name = "time_3h_vs_0h")
```
Filter this gene set based on adjusted p-value (statistical significance, padj) and potentially also on effect size (log2FoldChange)
If there are many differentially expressed genes you might consider performing an enrichment analysis to see if certain Gene Ontology terms are over-represented in genes responding to heat stress. You might want to adopt an approach based on ClusterProfiler, covered by Ulf Schmitz in his lecture on the topic. In order to do this you will need to make use of the data in annotation_table.tsv which gives GO terms for the anemone genes.