This dataset comes from the following paper by JCU researchers;
Gene expression differences between abalone that are susceptible and resilient to a simulated heat wave event
The full paper can be found at this link: Shiel et al 2020
You should choose this dataset if you want to develop your skills working with DESeq2 to perform statistical analysis of RNASeq data after the count table has been produced. As we have not explored this extensively in class be aware that if you do choose this project you may need to read more about DESEq2 and/or ask your instructor for help.
This is a complex experiment and the raw data is very large. You are provided with partially processed data as follows;
File Name | Purpose |
---|---|
count_table.tsv |
A matrix of counts. Transcripts are in rows. Samples are in columns. The first column provides the transcript names |
metadata.tsv |
Metadata for all the samples. (See Belows) |
To download and unpack these files use the following commands. You should run these commands from the top level of your RStudio project directory for the independent project assignment.
wget 'http://data.qld.edu.au/public/Q5999/JCUBioinformatics/independent-project/data_abalone.tgz' -O data.tgz
tar -zxvf data.tgz
The sample metadata has many columns reflecting the complex nature of this experiment. Columns are;
Column Name | Meaning |
---|---|
Location | One of three location codes, E (Elliston), F (Farm Beach), S (Port Lincoln) |
Condition | Coded as U (Unsusceptible to heat stress) and S (Susceptible to heat stress) |
Tank | Numerical ID representing the tank that the ablone were kept in |
Abalone | Numerical ID representing the individual Abalone |
Time | One of four time points (see below) |
A map of locations is provided in Figure 1 of Shiel et al
The U and S codes were assigned to abalone depending on whether they survived (U) or died (S) during the post heat-stress time period. The idea is that Abalone more affected by heat stress would be likely to die whereas those less affected would survive.
In this experiment time is also related to temperature. Time points should be interpreted as follows;
Time Number | Meaning |
---|---|
1 | February prior to transfer to tanks |
2 | 18 Degrees Celsius |
3 | 20 Degrees Celsius |
4 | 21 Degrees Celsius |
A proper statistical analysis of this data is very complex and is not actually possible with DESeq. This is because the same Abalone were measured repeatedly which necessitates use of a random effect term in the linear model. Another R package called limma
is required to deal with this.
Instead of attempting a full analysis you should focus on the following;
The following code should help you to get started with PCA analysis
library(DESeq2)
counts <- read_tsv("raw_data/Abalone/count_table.tsv")
metadata <- read_tsv("raw_data/Abalone/metadata.tsv")
# Ensure that the column order in counts is the same as row order in metadata
counts <- counts[,metadata$Sample]
# Imports data to DESeq without setting a design matrix
dds <- DESeqDataSetFromMatrix(counts,colData = metadata, design = ~ 1)
# Performs a variance stabilising transform to make data suitable for plotting with a PCA
vst <- varianceStabilizingTransformation(dds)
# Perform a PCA on the vst data
pcdata <- prcomp(assay(vst))
# Extract the rotation matrix from the PCA and join it with metadata
pcdata_meta <- pcdata$rotation %>% as.data.frame() %>%
rownames_to_column("Sample") %>%
left_join(metadata)
When visualising the data or performing statistical tests take care because the metadata uses numerical values to encode categorical variables (eg Tank). You should convert these to a factor or they will be treated as a continuous variable.
ggplot(pcdata_meta,aes(x=PC1,y=PC2)) +
geom_point(aes(color=as.factor(Tank)))
In the RNASeq tutorial we saw how to use DESeq to analyse a very simple experimental design with one factor. We can apply this to the Abalone data if we pick a factor of interest. For example, Location
dds <- DESeqDataSetFromMatrix(counts,colData = metadata, design = ~ Location)
dds_fitted <- DESeq(dds)
To keep things simple you should probably try to trim down the data so that you can sensibly analyse it with one factor. If needed you can create more complex statistical models with multiple factors as described here
http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#multi-factor-designs