Load tidyverse packages
library(tidyverse)
Start with a very simple pipe example
x <- c(1,1)
sum(x)
Wow!, 1 + 1 = 2
Equivalently this can be accomplished with a pipe
x %>% sum()
Let’s write our own function to work with a pipe
triple <- function(x){
c(x,x,x)
}
# Try it out
triple("Hello")
Let’s try piping
# 3 Dogs
"dog" %>% triple()
Question How many dogs will I get from this line of code?
"dog" %>% triple() %>% triple()
Don’t manually count them. Pipe!
"dog" %>% triple() %>% triple() %>% length()
Here’s what this code would look like without using pipes
length(triple(triple("dog")))
The pipe is clearly much easier to read.
This time we will examine beta diversity metrics.
First download beta diversity data. We will start with the raw jaccard distance matrix
wget https://docs.qiime2.org/2023.5/data/tutorials/moving-pictures/core-metrics-results/jaccard_distance_matrix.qza
Next we use the qiime
to export the data from a qiime zip archive
(qza) format to a simple tab separated text file (tsv).
qiime tools export --input-path jaccard_distance_matrix.qza --output-path jaccard_distance_matrix
The tsv
file can be imported using read_tsv()
from the readr
package.
jac <- read_tsv("jaccard_distance_matrix/distance-matrix.tsv")
After running the import take a look at the data in RStudio. Remember that it is a distance matrix so we expect it to have an equal number of rows and columns (square matrix). The value in each cell should be the jaccard distance between the samples corresponding to that row-column pair.
This dataset is not tidy! We’re going to tidy it but first we’ll make a
quick heatmap using the non-tidyverse function heatmap
. Heatmaps are a
useful way of visualising data in matrix form.
The heatmap()
function takes a matrix
as its argument. So let’s
change our import command so that it produces a matrix
jac_dm <- read_tsv("jaccard_distance_matrix/distance-matrix.tsv") %>%
column_to_rownames("...1") %>%
as.matrix()
A few things to note:
column_to_rownames()
was needed because matrices support one data
type only. In this case numericas.matrix()
converts the tibble into a matrix at the end.heatmap(jac_dm)
Making a heatmap allows us to see that there seem to be at least three separate blocks of samples, each of which is relatively closely related to each other (darker colors).
While it was easy to make this heatmap it is (unfortunately) more difficult to extend it. What we’d like to do for example would be highlight samples according to metadata so that we could discover what is driving the three groupings.
Heatmaps are one of the few plot types not supported by ggplot
. An
excellent (though somewhat complex) package for making very high quality
heatmaps is ComplexHeatmap
available from Bioconductor. We won’t be
covering this package as it is really only useful for heatmaps and takes
quite a bit of time to learn.
A natural representation of distance matrix data is the tree based on
hierarchical clustering. The hclust
function in R can be used to
perform this operation. To use it first need to convert the distance
matrix into a special R object of type dist
.
jacc_dist <- dist(jac_dm)
jacc_clust <- hclust(jacc_dist)
The best way to plot trees in R is using a package called ggtree
. We
don’t have time to cover that package here but can easily make a simple
plot using the base-R plot
function.
plot(jacc_clust)
The explore this further we need the metadata. First download it
wget "https://data.qiime2.org/2023.5/tutorials/moving-pictures/sample_metadata.tsv" -O sample-metadata.tsv
Remember from last week’s lecture that the sample metadata could be read like this;
sample_data <- read_tsv("sample-metadata.tsv",comment = "#q") %>%
rename(sample=`sample-id`)
Now we can show the cluster labels according to body site. Notice how cumbersome this is in base-r compared with ggplot.
sample_body_sites <- sample_data$`body-site`
names(sample_body_sites) <- sample_data$sample
plot(jacc_clust,labels = sample_body_sites[jacc_clust$labels])
Another way to visualise beta diversity is using an MDS plot. We can use
the cmdscale
function in R to perform MDS based on a distance object
jacc_mds <- cmdscale(jacc_dist) %>%
as.data.frame() %>%
rownames_to_column("sample")
jacc_mds %>% ggplot(aes(V1,V2)) + geom_point()
Adding metadata is easy
jacc_mds_meta <- jacc_mds %>% left_join(sample_data)
jacc_mds_meta %>% ggplot(aes(V1,V2)) + geom_point(aes(color=`body-site`))
Now let’s convert the data to tidy format. First let’s consider why the distance matrix is “not” tidy. Consider the definition of tidy data;
The first two of these criteria are not satisfied. Each row actually
contains multiple observations. There are two variables which we will
call sample1
and sample2
. Although sample1
is in its own column
sample2
is spread across multiple columns.
jac_tidy <- read_tsv("jaccard_distance_matrix/distance-matrix.tsv") %>%
rename(sample1=`...1`) %>%
pivot_longer(-sample1,names_to = "sample2",values_to="jaccard")
This section demonstrates a range of data wrangling and plotting techniques. It focuses on getting the basic operations right but does not represent best-practice. Notice that no effort is made to label axes nicely, no captions are provided and in some cases the plots are not even particularly informative. When doing your metagenomics assignment you should not simply copy the plots in this section. For your assignment you should focus on the guidance given in lectures on scientific graphics and the instructions in your metagenomics assignment guide and report.
This section demonstrates some challenging plot types and non-standard ways of working with metagenomics data. It is entirely for your interest and will not be examined. You are not expected to know this material for exams or assignments.
After tidying the Jaccard data one potential way to plot it would be to consider the jaccard distances as defining a network. To make our network we say that;
To draw this we only want to consider connected samples. Use the
filter()
function from dplyr to select only rows corresponding to
connections
jac_tidy_connected <- jac_tidy %>% filter(jaccard < 0.8)
We can draw networks using the ggplot extension, ggraph. This is quite nice because it seems to show that we have three separate groups of samples. Samples within the groups are highly connected but there are few connections between the groups.
library(ggraph)
library(igraph)
jac_graph <- graph_from_data_frame(jac_tidy_connected)
ggraph(jac_graph) + geom_edge_link() + geom_node_point(color="red")
This provides us with metadata relating to the vertices in our graph. The ggraph package allows this to be added when constructing the graph. Now let’s plot the network again. This time colour the nodes according to the BodySite of sample1. We see that just using a simple jaccard distance threshold results in three (almost separate) sub-networks that separate gut, tongue and palm microbiomes.
jac_graph_meta <- graph_from_data_frame(jac_tidy_connected, vertices = sample_data)
ggraph(jac_graph_meta) +
geom_edge_link(aes(edge_alpha = jaccard)) +
geom_node_point(aes(color=`body-site`))
To create this visualisation we need several pieces of data
Let’s download the abundance table
wget https://docs.qiime2.org/2023.5/data/tutorials/moving-pictures/table-deblur.qza
Then export it using qiime
qiime tools export --input-path table-deblur.qza --output-path table-deblur
This time the table is exported in biom
format. We want it in tab
separated values so we need to perform an extra step
biom convert -i table-deblur/feature-table.biom -o table-deblur/feature-table.tsv --to-tsv
Now import the feature table
feature_table <- read_tsv("table-deblur/feature-table.tsv", skip = 1) %>%
rename(feature_id = `#OTU ID`)
Question: Take a look at this table. Is it tidy?
library(tidyverse)
feature_table_tidy <- feature_table %>%
pivot_longer(-feature_id,names_to = "sample_id",values_to = "count")
Next let’s get the taxonomy data.
wget https://docs.qiime2.org/2023.5/data/tutorials/moving-pictures/taxonomy.qza
qiime tools export --input-path taxonomy.qza --output-path taxonomy
Next we read in the taxonomy data
taxonomy <- read_tsv("taxonomy/taxonomy.tsv") %>%
rename(feature_id = `Feature ID`)
Now for a tricky part. We’re going to use a tidyverse function called
separate_rows
for this. Our aim is to tidy the Taxon column. This is a
tricky multi-step operation because that column encodes considerable
information.
The taxon entries look like this
k__Bacteria; p__Bacteroidetes; c__Flavobacteriia; …
We want to split them up by the separator “;” and each time we split we
want a new row to be created. This is what the separate_rows
function
does.
taxonomy_tidy <- taxonomy %>% separate_rows(Taxon,sep = ";")
Look at the result. Now each taxonomic level is on a separate row. It’s getting tidier but isn’t completely tidy yet. Look at each entry
k__Bacteria
This says that the “Kingdom” is “Bacteria”. So we actually have two variables in a single column here. The taxonomic level is “Kingdom” and the classification itself is “Bacteria”.
This time we need to separate again, but we want to make new columns not
new rows. For this we use separate
.
We extend our original pipe to add a new operation
taxonomy_tidy <- taxonomy %>%
separate_rows(Taxon,sep = ";") %>%
separate(Taxon,into = c("level","classification"),sep = "__", fill = "right")
Some entries have a level but no classification which leads to missing values.
We just eliminate those with a filter operation.
We also need to cleanup some whitespace in the level values
taxonomy_tidy <- taxonomy %>%
separate_rows(Taxon,sep = ";") %>%
separate(Taxon,into = c("level","classification"),sep = "__") %>%
filter(classification!="") %>%
mutate(level=str_trim(level))
Now we are ready to join the taxonomy and the feature table data together
taxa_features_tidy <- taxonomy_tidy %>% left_join(feature_table_tidy)
We want to make barcharts where taxonomic composition is shown for each sample summarised at a certain taxonomic level (eg Kingdom, Phylum, etc).
Let’s try phylum
phylum_data <- taxa_features_tidy %>%
filter(level=="p") %>%
group_by(sample_id,classification) %>%
summarise(count=sum(count)) %>% na.omit()
ggplot(phylum_data) + geom_col(aes(x=sample_id,y=count,fill=classification)) +
coord_flip()