In this tutorial you’ll learn how to inspect the quality of High-throughput sequencing and
perform a metagenomic assembly.
We will use data under the accession SRS018585 in the Sequence Read Archive. this sample is
“a Human Metagenome sample from G_DNA_Anterior nares of a male participant in the dbGaP study
HMP Core Microbiome Sampling Protocol A (HMP-A)”
Table of Contents
- Softwares Required for this Tutorial
- Getting the Data
- Quality Control
- Taxonomic Classification and Visualization
Softwares Required for this Tutorial
Getting the Data
wget http://downloads.hmpdacc.org/data/Illumina/anterior_nares/SRS018585.tar.bz2 tar xjf SRS018585.tar.bz2 cd SRS018585
we’ll use FastQC to check the quality of our data. FastQC can be downloaded and
ran on a Windows or LINUX computer without installation. It is available here
Start FastQC and select the fastq files you just downloaded with
file -> open
What is the average read length? The average quality?
Now we’ll trim the reads using sickle
sickle pe -f SRS018585.denovo_duplicates_marked.trimmed.1.fastq \ -r SRS018585.denovo_duplicates_marked.trimmed.2.fastq -t sanger \ -o SRS018585_trimmed_1.fastq -p SRS018585_trimmed_2.fastq -s unpaired.fastq
sickle normally gives you a summary of how many reads were trimmed.
SPAdes will be used for the assembly. Since version 3.7, SPAdes includes a metagenomic version of its algorithm, callable
with the option –meta
spades.py --meta -1 SRS018585_trimmed_1.fastq -2 SRS018585_trimmed_2.fastq -t 8 -o assembly
the resulting assenmbly can be found under assembly/scaffolds.fasta. How many contigs does this assembly contain?
How long is the longest contig and to what organism does it belong to?
Taxonomic Classification and Visualization
For the vizualisation of the assembly we will use a tool called blobtools.
Blobtools produces “Taxon annotated GC-coverage plots” (TAGC) and was orignially made for
the visualisation of (draft) genome assemblies.
mkdir blobtools && cd $_ blastn -num_threads 8 -db nt -query ../assembly/scaffolds.fasta -out blastresults.txt -outfmt '6 qseqid staxids bitscore'
This blast step is necessary to obtain the taxonomic information of your contigs.
It might take a while. Be patient!
blobtools create -i ../assembly/scaffolds.fasta -y spades -t blastresults.txt \ --nodes /export/databases/taxonomy/nodes.dmp \ --names /export/databases/taxonomy/names.dmp \ -o scaffolds --title SRS018585 blobtools plot -i scaffolds.blob.BlobDB.json -o scaffolds --title -r family
Inspect the plot, what is the most abundant families? try to play with the parameters