BC3203

This file forms the assessable component of your metagenomics assignment. It consists of two sections, each of which asks you to look into an aspect of the qiime tutorial in greater detail. Although the assignment uses metagenomics data the key learning outcomes are more general and relate to;

Understanding next generation sequencing (NGS)
Using a tidy data workflow to make inferences about large datasets
Effective communication using RMarkdown to combine data summaries (eg plots) alongside explanatory text.

Make sure you complete the qiime tutorial before you start this report. Example code in this report assumes that you have completely run through the qiime tutorial and that files generated by that tutorial are present with names and paths unchanged.

How to complete this report

Remove the instruction text (ie this text) and replace it with your work. You can always view the original report instructions at this link
As you make changes commit them to git and push to github. Ensure your final changes are pushed to github by the deadline
Compile the final report to pdf format using the Knit function in RStudio. Make sure your final pdf is nicely formatted and is not missing any important content. By “nicely formatted” I mean;

Use headings, bold, italics and code blocks appropriately.
Position figures appropriately in relation to the text
Provide good figure captions for every figure

The compiled version of the document should contain figures and text only. It should not contain displayed code chunks or raw code outputs (eg error messages, warnings or verbose output). Use knitr options to make sure this undesirable content is hidden in the final doc. This tutorial from the rstudio team provides a good overview of the typical knitr options that you might want to use. At a bare minimum you will want to use;

echo=FALSE to avoid displaying code chunks
warning=FALSE to avoid displaying errors

Although your code (unix commands and R code) should be hidden in the final version of your report it still forms part of your assessment. Your instructor will access your raw report.Rmd file to read this code and assess it.

If you want to describe your hidden code use a comment. For example;

``{bash, echo = FALSE}
#Unzip barcodes `fastq` and pipe this to awk to extract only sequence lines. Output is saved to `cache/barcode_seqs.txt`
gunzip -c barcodes.fastq.gz | awk 'NR % 4 == 2' > cache/barcodes.txt
``

Sometimes you will want to include code (for assessment purposes) but you don’t want that code to be run every time you compile your report. This is especially true for bash code chunks containing qiime commands which typically only need to be run once. To make report compilation faster and more reliable you can ask knitr to avoid running certain chunks with the option eval = FALSE. Take care with this option though, especially for R code since important bits of code further down in your report (eg generating a plot) might depend on code that appears near the top (eg reading a dataset).

How is code assessed in my report?

As stated above the code for your report is part of your assessment. Code will be assessed on the following criteria

That is gets the job done (does what it is supposed to do)
That it uses the tidyverse style guide for variable naming, use of whitespace etc. This applies to R code
That it is succinct. Less code is better code
That it is easy to read and understand (tidyverse style guide will help but also pay attention to variable names and logical order)

Please do not include all the code from the metagenomics tutorial (ie the guide). You need only include the code you needed to run in order to generate plots for this report. That is going to be mostly R code but might include some small bash code chunks for converting data formats etc.

Sometimes comments are useful in code, however, you should not add comments just for the sake of it. If the intent of your code is clear you do not need to add a comment.

Section 1: Mystery of the extra barcodes

After completing the section, inspecting raw data in the guide you discovered that the number of unique barcodes in raw_data/emp-single-end-sequences/barcodes.fastq.gz was much larger than the expected number of barcodes based on the number present in raw_data/sample-metadata.tsv.

Your task in this section is to explain this discrepancy. You will need to use your knowledge of data exploration techniques as well as your understanding of the limitations of sequencing technology.

Your answer should be brief (at most 1 page) and should include at least one high quality figure. Structure your answer by first explaining how many unique barcodes would be expected in barcodes.fastq.gz and why. Then explain your theory as to where the extra barcodes have come from. Back up your theory by referring to plots and/or tables made using R code.

The following instructions should help get you started with your analysis

Use bash commands to extract the barcodes data (ie raw_data/emp-single-end-sequences/barcodes.fastq.gz) into a tabular form.

mkdir cache
gunzip -c raw_data/emp-single-end-sequences/barcodes.fastq.gz | awk 'NR % 4 == 2' > cache/barcodes.txt

Use tidyverse commands to read the barcodes and their metadata (ie raw_data/sample-metadata.tsv) into R

library(tidyverse)
barcodes <- read_table("cache/barcodes.txt",col_names = c("barcode-sequence"))
metadata <- read_tsv("raw_data/sample-metadata.tsv", comment = "#")

Use tidyverse commands to join these two datasets together.

Now use some additional tidyverse commands to count the number of times each barcode occurs in the data. You should notice that some barcodes occur a large number of times, whereas others occur only rarely. Use ggplot to create a plot that clearly demonstrates this.

You should also consider the fact that some barcodes appear in the sample metadata file whereas others do not. Does this help explain the patter you observed above? Can you add this information to your original plot?

Use tidyverse commands and ggplot to explore the data a little more. See if you can find the best possible way to illustrates patterns of barcode multiplicity and barcode validity (ie is it in the metadata). There are many ways of doing this and no absolute correct answer.

At the end of this process you should have a single plot. Your plot will be marked according to the following criteria

Are the important patterns in the data clearly visible? (ie did you make smart choices about what data to plot and how?)
Is the plot neat and tidy (no overlapping text), graphical elements appropriately sized
Are the axes labelled correctly
Does the plot have a high quality caption?

In addition to your plot you should provide a single paragraph of text explaining your theory for the origin of the extra barcodes. When explaining your theory you should refer to your plot to illustrate your argument.

Hint: think about how errors in sequencing could influence occurrence of barcodes. Consider random errors vs systematic errors. Does your plot show trends typical of random or systematic errors or both? You do not need to understand exactly how these errors might arise.

How to write a caption for a plot

Plots in scientific papers are always accompanied by a caption. The caption appears immediately below the plot and has the following structure

Figure X: Short title: Descriptive text clarifying any aspects of the plot that are not immediately obvious from the plot itself.

Here X will be a number so that the figure can be referred to in the text. The short title should briefly summarise the plot and might refer to the plotting style and origin of the data. For example “Histogram showing the relationship between X and X in the XX data”. The descriptive text portion of the caption can be any length and is used to help the reader interpret the plot. For example in your barcodes plot you might use shorthand notation to label axes within the plot itself (eg X-axis - “Barcode multiplicity”) but since this is not particularly clear you could describe what “multiplicity” means in a little more detail in the caption.

The plot caption should not include any interpretation of the data (that belongs in the results).

Many plotting packages (including ggplot) have a facility to add a title to a plot. This is not appropriate for adding captions because it places the text at the top of the plot and because the title text will become part of the image (causes readability issues). Do not use this feature!. Your captions should always appear below the plot and be formatted as instructed above. Plots in scientific papers always have a caption but will never include a title above the plot.

The simplest and easiest way to add captions in RMarkdown is to add your caption as normal text below the plot. I would suggest using bold for the figure label (eg Figure X). Alternatively you may use the fig.cap option within your RMarkdown chunk. This latter facility works very well when knitting to pdf but does not work for all RMarkdown formats and can be a little tricky to use.

Section 2: Diversity Analysis

In this section you will use tidyverse functions to read and plot diversity statistics that you generated in the tutorial.

Example code is provided that demonstrates the basic idea of how make plots based on alpha and beta diversity, but the code is incomplete. Your task in this section will be to complete the code to generate plots for both alpha and beta diversity. After you make initial fixes to the code to generate a basic plot you should play around with your plot to improve it. Pay attention to the following;

Choice of variables in the data (pick those with maximum biological interest)
Choice of aesthetic mappings to show variables in the data on the plot
Sample ordering
Axis labels and legends
Caption

Both your plots in this section will be marked according to the same criteria outlined for the plot in section 1. You will end up with two plots. One for alpha diversity and another for beta diversity. For each plot you should provide a single paragraph of text describing the biological inferences that can be made from the plot. Be sure to refer to your plots in your text and make sure you interpret alpha and beta statistics appropriately.

Alpha Diversity : Demo Code

Use qiime tools export to export results for Faith’s phylogenetic diversity

qiime tools export --input-path qiime/core-metrics-results/faith_pd_vector.qza --output-path qiime/faith_pd

Since there is no visualisation for alpha diversity available via qiime we create our own in R. The following code demonstrates one way to do this.

# Load the tidyverse package which includes functions for data manipulation and plotting
library(tidyverse)

# Read the faith_pd values
faith_pd <- read_tsv("#fixme")

# Read the sample-metadata
sample_metadata <- read_tsv("#fixme")

# Join sample metadata and faith_pd data
faith_pd_annotated <- faith_pd %>% left_join(sample_metadata, by = c("fixme"))

# Joining the data is very useful as it allows us to plot values and color according to metadata
ggplot(faith_pd_annotated,aes(x=sample)) + 
  geom_col(aes(y=faith_pd, fill = `reported-antibiotic-usage`)) +
  theme(axis.text.x = element_text(angle=90))

This plot illustrates the basics of getting the data into R and plotting it. The resulting plot is not a good way to visualise this data. See if you can improve it. Things to consider are;

Have a careful look at the metadata for the experiment. You should see that there are just two subjects (people) and that multiple samples were taken over time (days-since-experiment-start) for each person.
Think about the natural ordering in this data? Think about major factors that might structure variation in the data.
Don’t get stuck on a barplot. Is there another plot style that might be more sensible?

Beta Diversity : Demo Code

The code below uses qiime tools export to export the Jaccard distance matrix. This will allow us to use tidyverse to create plots based on Jaccard distances.

qiime tools export --input-path qiime/core-metrics-results/jaccard_distance_matrix.qza --output-path qiime/jaccard

The code below shows how to read the Jaccard distance matrix we just created and perform MDS (multidimensional scaling) on it to reduce it to two dimensions. We then plot samples as points in two dimensions.

Use this code as a starting point for creating a useful plot based on a beta diversity statistic (Jaccard distance). Your plot should incorporate information from sample-metadata, and use aesthetic mappings to highlight key variables in sample-metadata that drive sample clustering based on Jaccard distances.

jaccard_dm <- read_tsv("qiime/jaccard/distance-matrix.tsv") %>% 
  column_to_rownames("X1") %>% 
  as.matrix()

# cmdscale is an R function that performs MDS analysis
# as.dist is an R function that converts a matrix of distances into an R dist object
mds <- cmdscale(as.dist(jaccard_dm), k=2) %>% 
  as.data.frame() %>% 
  rownames_to_column("id") 

ggplot(mds) + geom_point(aes(x=V1,y=V2))