We can download a single protein sequence
wget http://www.uniprot.org/uniprot/B2DCR8.fasta
Download all golden cuttlefish sequences from Uniprot
wget -O se_proteins.fasta "https://www.uniprot.org/uniprot/?query=organism:31210&format=fasta"
Find all the definition lines
grep '>' se_proteins.fasta
Count them
grep '>' se_proteins.fasta | wc -l
Download moving pictures data using wget
wget -O "sequences.fastq.gz" "https://data.qiime2.org/2017.12/tutorials/moving-pictures/emp-single-end-sequences/sequences.fastq.gz"
Look at the file we just downloaded
ls -l
ls -lh
Is it writable?
Make it read-only
chmod u-w sequences.fastq.gz
ls -l
head
. What does the extension suggest about the fileWe can uncompress with gunzip
gunzip sequences.fastq.gz
Now look at the top of the file
head sequences.fastq
Actually we prefer to work directly with compressed data if at all possible. This is because uncompressing can be done “on the fly” which saves disk space and is faster since reading from disk can be a bottleneck.
gzip sequences.fastq
gunzip -c sequences.fastq.gz
gunzip -c sequences.fastq.gz | head
fastqc
is a program that you will install on your server during tutorial 2.
Whenever you try a new command it’s often a good idea to just try typing its name. Many programs will print usage information if you do this
fastqc
Yuck!! A horrible java stacktrace.
Never mind. Try with --help
fastqc --help
Now run fastqc
fastqc sequences.fastq.gz
Real experiment usually consist of many fastq files. Checking each individually with fastqc is quite cumbersome but luckily there is a great program called multiqc
that will summarise the results of many fastqc reports. If you have a directory full of fastqc results you can summarise them all like this;
multiqc .
We won’t run multiqc in this tutorial but we will examine multiqc reports for a range of data types.
Download example data
wget https://bc3203.s3.ap-southeast-2.amazonaws.com/multiqc_reports.tgz
tar -zxvf multiqc_reports.tgz
This should create a folder called multiqc_reports
. Examine each of the three html files in this folder. They represent examples of typical reports from three different sequencing types.
Whole genome sequencing (dna_report.html
)
16S metagenomic sequencing (16S_report.html
)
mRNA sequencing (mRNA_report.html
)