BC3203

FASTA

We can download a single protein sequence

wget http://www.uniprot.org/uniprot/B2DCR8.fasta

Download all golden cuttlefish sequences from Uniprot

wget -O se_proteins.fasta "https://www.uniprot.org/uniprot/?query=organism:31210&format=fasta"

Find all the definition lines

grep '>' se_proteins.fasta

Count them

grep '>' se_proteins.fasta | wc -l

FastQC

Step 1: Downloading

Download moving pictures data using wget

wget -O "sequences.fastq.gz" "https://data.qiime2.org/2017.12/tutorials/moving-pictures/emp-single-end-sequences/sequences.fastq.gz"

Step 2: Check attributes and make read only

Look at the file we just downloaded

  1. How big is the file?
ls -l
ls -lh
  1. Is it writable?

  2. Make it read-only

chmod u-w sequences.fastq.gz
ls -l

Step 3: Inspect file contents

  1. Try looking at it with head. What does the extension suggest about the file

We can uncompress with gunzip

gunzip sequences.fastq.gz
  1. How big is the file after compression.

Now look at the top of the file

head sequences.fastq

Actually we prefer to work directly with compressed data if at all possible. This is because uncompressing can be done “on the fly” which saves disk space and is faster since reading from disk can be a bottleneck.

gzip sequences.fastq
gunzip -c sequences.fastq.gz
gunzip -c sequences.fastq.gz | head

Step 4: Run FastQC

fastqc is a program that you will install on your server during tutorial 2.

Whenever you try a new command it’s often a good idea to just try typing its name. Many programs will print usage information if you do this

fastqc

Yuck!! A horrible java stacktrace.

Never mind. Try with --help

fastqc --help

Now run fastqc

fastqc sequences.fastq.gz

More FastQC Examples and MultiQC

Real experiment usually consist of many fastq files. Checking each individually with fastqc is quite cumbersome but luckily there is a great program called multiqc that will summarise the results of many fastqc reports. If you have a directory full of fastqc results you can summarise them all like this;

multiqc .

We won’t run multiqc in this tutorial but we will examine multiqc reports for a range of data types.

Download example data

wget https://bc3203.s3.ap-southeast-2.amazonaws.com/multiqc_reports.tgz
tar -zxvf multiqc_reports.tgz

This should create a folder called multiqc_reports. Examine each of the three html files in this folder. They represent examples of typical reports from three different sequencing types.

  1. Whole genome sequencing (dna_report.html)

  2. 16S metagenomic sequencing (16S_report.html)

  3. mRNA sequencing (mRNA_report.html)