BC3203

This dataset comes from the following paper by JCU researchers;

Gene expression differences between abalone that are susceptible and resilient to a simulated heat wave event

The full paper can be found at this link: Shiel et al 2020

You should choose this dataset if you want to develop your skills working with DESeq2 to perform statistical analysis of RNASeq data after the count table has been produced. As we have not explored this extensively in class be aware that if you do choose this project you may need to read more about DESEq2 and/or ask your instructor for help.

This is a complex experiment and the raw data is very large. You are provided with partially processed data as follows;

File Name Purpose
count_table.tsv A matrix of counts. Transcripts are in rows. Samples are in columns. The first column provides the transcript names
metadata.tsv Metadata for all the samples. (See Belows)

To download and unpack these files use the following commands. You should run these commands from the top level of your RStudio project directory for the independent project assignment.

wget 'http://data.qld.edu.au/public/Q5999/JCUBioinformatics/independent-project/data_abalone.tgz' -O data.tgz
tar -zxvf data.tgz

The sample metadata has many columns reflecting the complex nature of this experiment. Columns are;

Column Name Meaning
Location One of three location codes, E (Elliston), F (Farm Beach), S (Port Lincoln)
Condition Coded as U (Unsusceptible to heat stress) and S (Susceptible to heat stress)
Tank Numerical ID representing the tank that the ablone were kept in
Abalone Numerical ID representing the individual Abalone
Time One of four time points (see below)
Time Number Meaning
1 February prior to transfer to tanks
2 18 Degrees Celsius
3 20 Degrees Celsius
4 21 Degrees Celsius

Suggested Analyses

A proper statistical analysis of this data is very complex and is not actually possible with DESeq. This is because the same Abalone were measured repeatedly which necessitates use of a random effect term in the linear model. Another R package called limma is required to deal with this.

Instead of attempting a full analysis you should focus on the following;

The following code should help you to get started with PCA analysis

Getting Started

library(DESeq2)
counts <- read_tsv("raw_data/Abalone/count_table.tsv")
metadata <- read_tsv("raw_data/Abalone/metadata.tsv")

# Ensure that the column order in counts is the same as row order in metadata
counts <- counts[,metadata$Sample]

# Imports data to DESeq without setting a design matrix
dds <- DESeqDataSetFromMatrix(counts,colData = metadata, design = ~ 1)

# Performs a variance stabilising transform to make data suitable for plotting with a PCA
vst <- varianceStabilizingTransformation(dds)

# Perform a PCA on the vst data
pcdata <- prcomp(assay(vst))

# Extract the rotation matrix from the PCA and join it with metadata
pcdata_meta <- pcdata$rotation %>% as.data.frame() %>% 
  rownames_to_column("Sample") %>% 
  left_join(metadata)

When visualising the data or performing statistical tests take care because the metadata uses numerical values to encode categorical variables (eg Tank). You should convert these to a factor or they will be treated as a continuous variable.

ggplot(pcdata_meta,aes(x=PC1,y=PC2)) + 
  geom_point(aes(color=as.factor(Tank)))

plot of chunk unnamed-chunk-3

In the RNASeq tutorial we saw how to use DESeq to analyse a very simple experimental design with one factor. We can apply this to the Abalone data if we pick a factor of interest. For example, Location

dds <- DESeqDataSetFromMatrix(counts,colData = metadata, design = ~ Location)
dds_fitted <- DESeq(dds)

To keep things simple you should probably try to trim down the data so that you can sensibly analyse it with one factor. If needed you can create more complex statistical models with multiple factors as described here

http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#multi-factor-designs