BC3203

This dataset comes from the following paper by JCU researchers;

Gene expression differences between abalone that are susceptible and resilient to a simulated heat wave event

The full paper can be found at this link: Shiel et al 2020

You should choose this dataset if you want to develop your skills working with DESeq2 to perform statistical analysis of RNASeq data after the count table has been produced. As we have not explored this extensively in class be aware that if you do choose this project you may need to read more about DESEq2 and/or ask your instructor for help.

This is a complex experiment and the raw data is very large. You are provided with partially processed data as follows;

File Name	Purpose
`count_table.tsv`	A matrix of counts. Transcripts are in rows. Samples are in columns. The first column provides the transcript names
`metadata.tsv`	Metadata for all the samples. (See Belows)

To download and unpack these files use the following commands. You should run these commands from the top level of your RStudio project directory for the independent project assignment.

wget 'http://data.qld.edu.au/public/Q5999/JCUBioinformatics/independent-project/data_abalone.tgz' -O data.tgz
tar -zxvf data.tgz

The sample metadata has many columns reflecting the complex nature of this experiment. Columns are;

Column Name	Meaning
Location	One of three location codes, E (Elliston), F (Farm Beach), S (Port Lincoln)
Condition	Coded as U (Unsusceptible to heat stress) and S (Susceptible to heat stress)
Tank	Numerical ID representing the tank that the ablone were kept in
Abalone	Numerical ID representing the individual Abalone
Time	One of four time points (see below)

A map of locations is provided in Figure 1 of Shiel et al
The U and S codes were assigned to abalone depending on whether they survived (U) or died (S) during the post heat-stress time period. The idea is that Abalone more affected by heat stress would be likely to die whereas those less affected would survive.
In this experiment time is also related to temperature. Time points should be interpreted as follows;

Time Number	Meaning
1	February prior to transfer to tanks
2	18 Degrees Celsius
3	20 Degrees Celsius
4	21 Degrees Celsius

Suggested Analyses

A proper statistical analysis of this data is very complex and is not actually possible with DESeq. This is because the same Abalone were measured repeatedly which necessitates use of a random effect term in the linear model. Another R package called limma is required to deal with this.

Instead of attempting a full analysis you should focus on the following;

Use PCA to explore the data and determine experimental factors that have a strong effect on gene expression
Create a subset of the data (select only certain samples) that would allow you to perform a simple statistical analysis (eg select only one time point, ignore some factors in the experiment).
Report numbers of differentially expressed genes from your analysis. Don’t attempt to interpret gene functions (no info is provided on this anyway).

The following code should help you to get started with PCA analysis

Getting Started

library(DESeq2)
counts <- read_tsv("raw_data/Abalone/count_table.tsv")
metadata <- read_tsv("raw_data/Abalone/metadata.tsv")

# Ensure that the column order in counts is the same as row order in metadata
counts <- counts[,metadata$Sample]

# Imports data to DESeq without setting a design matrix
dds <- DESeqDataSetFromMatrix(counts,colData = metadata, design = ~ 1)

# Performs a variance stabilising transform to make data suitable for plotting with a PCA
vst <- varianceStabilizingTransformation(dds)

# Perform a PCA on the vst data
pcdata <- prcomp(assay(vst))

# Extract the rotation matrix from the PCA and join it with metadata
pcdata_meta <- pcdata$rotation %>% as.data.frame() %>% 
  rownames_to_column("Sample") %>% 
  left_join(metadata)

When visualising the data or performing statistical tests take care because the metadata uses numerical values to encode categorical variables (eg Tank). You should convert these to a factor or they will be treated as a continuous variable.

ggplot(pcdata_meta,aes(x=PC1,y=PC2)) + 
  geom_point(aes(color=as.factor(Tank)))

plot of chunk unnamed-chunk-3

In the RNASeq tutorial we saw how to use DESeq to analyse a very simple experimental design with one factor. We can apply this to the Abalone data if we pick a factor of interest. For example, Location

dds <- DESeqDataSetFromMatrix(counts,colData = metadata, design = ~ Location)
dds_fitted <- DESeq(dds)

To keep things simple you should probably try to trim down the data so that you can sensibly analyse it with one factor. If needed you can create more complex statistical models with multiple factors as described here

http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#multi-factor-designs