QUAST manual

QUAST stands for QUality ASsesment Tool. QUAST is intended to evaluate genome assemblies by computing various metrics (e.g., N50, number of ORFs, etc.). QUAST was tested on single-chromosomal reference genomes and may not work well on the multi-chromosomal ones.
This manual will help you to run the tool and understand its output. QUAST is distributed within SPAdes release. See SPAdes manual for download and installation details.
QUAST uses Plantagora and MUMmer.

1. Running QUAST

1.1 Input data

QUAST can work with assemblies and references in FASTA format. Files may be compressed with gzip.

You can specify files with genes and operons annotations. This files should be either in GFF format (version 2 and version 3 are supported) or in plain TXT format with 4 tab-separated columns:

The coordinates are 1-based and if start postion less than end postion than that gene or operon is on the positive strand, otherwise it is on the negative strand.

You can find example of reference file, genes and operons annotations (both in TXT and GFF ver.2 formats) in your SPAdes installation directory in the subdirectory test_dataset. These files are correspond to the toy dataset distributed with SPAdes (first 1000 bp of E. coli).

1.2 Command line options

QUAST is run from the command line as follows:


    quast.py [options] <contig_file(s)>

Here is available options:

-o <output_dir>

    Specify the output directory. The default value is results_<date_time>.

-R <filename>

    File with reference genome. Most metrics can't be evaluated without reference.

-G <filename> (or --genes <filename>)

    File with genes annotations for given species. See details about file format in section 1.1

-O <filename> (or --operons <filename>)

    File with operons annotations for given species. See details about file format in section 1.1

--min-contig <int>

    Sets the lower threshold for contig length. Only contigs of length ≥ the threshold will be taken into account (except of some metrics, see section 2). The default value is 0.

--contig-thresholds <int,int,...>

    Comma-separated list of contig length thresholds. Used in # contigs ≥ x and Total length (≥ x) metrics (see section 2). The default value is 110,201,501,1001.

--orf <int,int,...>

    Comma-separated list of threshold lengths (in codons) of ORFs to search for. Used in # contigs ≥ x metrics (see section 2). The default value is 200 (i.e. 600 bp).

--not-circular

    This flag should be set if genome is NOT the circular one (e.g., it is an eukaryote).

-h (or --help)

    Print help.

2. Understanding the output

QUAST computes various metrics and this section is intendend to describe them.
Values of main metrics are presented in the file <QUAST_output_dir>/report.txt. There is also tab-separated version of this file in <QUAST_output_dir>/report.tsv (suitable for Google-docs).
Plots are presented in the file <QUAST_output_dir>/plots.pdf.

Note:

2.1 Metrics descriptions

# contigs ≥ x   is the total number of contigs of length ≥ x. This metric doesn't depend on --min-contig command line parameter (see section 1.2).

Total length (≥ x)   is the total number of bases in contigs of length ≥ x. This metric doesn't depend on --min-contig command line parameter (see section 1.2).


All remaining metrics are computed on the contigs of length ≥ x where x is specified in command line option --min-contig (see section 1.2, default is 0).

N50   is the contig length such that using longer or equal length contigs produces half (50%) the bases of the assembly. Usually there is no value that produces exactly 50%, so the more technical definition is the minimal length x such that using contigs of length at least x accounts for at least 50% of the total assembly length.

NG50   is the contig length such that using longer or equal length contigs produces half (50%) the bases of the reference genome. This metric could be computed only if the reference is given.

N75 and NG75   are defined similarly with 75% instead of 50%.

Number of contigs   is the total number of contigs in the assembly.

Largest contig   is the length of the longest contig in the assembly.

Total length   is the total number of bases in the assembly.

Reference length   is the total number of bases in the reference.

Average %IDY   is the average of alignment identity percent (i.e. alignment accuracy) among all contigs.

Misassemblies   gives the number of positions at the assembled contigs where the left flanking sequence aligns over 1kb away from the right flanking sequence on the reference or they overlap on more than 1 kb or flanking sequences align on different strands. This metric could be computed only if the reference is given.

Misassembled Contigs   is the number of contigs that contain misassembly events.

Misassembled Contig Bases   is the number of total bases contained in all contigs that have one or more misassemblies.

Misassembled and Unaligned   the number of contigs which contain misassembly events but which have alignment to the reference of length less than 10% of contigs length (almost unaligned contigs).

Unaligned Contigs   the number of contigs that have no alignment to the reference sequence. The value "X + Y part" means: X totally unaligned contigs plus Y partially unaligned contigs.

Unaligned Contig Bases   the total number of unaligned bases in totally and partially unaligned contigs (only bases from unaligned parts).

Ambiguous Contigs   the number of contigs which have reference alignments of equal quality in multiple locations on the reference. The value "X (Y)" means X ambiguous contigs with Y bases in them.

NA50, NGA50, NA75, NGA75 (A stands for "aligned")   like the same metrics without A but in this case we count lengths of aligned blocks instead of contigs lengths. I.e. if a contig has a misassembly with respect to the reference, we break the contig into smaller pieces. Also, if a contig has bases that don't align to the reference, they are not counted in NAx, NGAx.

Mapped genome (%)   the ratio of total number of aligned bp in the assembly to the genome size. Short contigs (less than 500 bp) that map to multiple places may be counted many times in this quantity. A base in the genome is counted as aligned one if there is at least one contig with at least one alignment with this base.

Genes   the number of genes in the assembly (full and partial), based on the positions the contigs map to in the reference genome with an annotated list of genes. The value "X + Y part" means X full genes and Y partial ones (at least 100 bases are aligned). This metric could be computed only if the reference and genes annotations file are given (see section 1.2).

Operons   is defined similarly to the previous metric but you need operons annotations file for computing this metric.

# ORFs >= x bp   is the total number of Open Reading Frames of length ≥ x bp founded in the assembly (the length of one codon is 3 bp). This metric will be computed if genes annotations file is not given.

2.1 Plots descriptions

Cumulative length

    This plot shows the growth rate of assemblies lengthes. The y-axis is the size of assembly and x-axis is the number of contigs in the assembly (from the largest one to the smallest one).

Nx

    This plot shows the changes of Nx metric value in dependence of x value. The y-axis is the Nx value and x-axis is the x value.

NGx

    This plot is similar to the previous one but for NGx metric.

Cumulative length (aligned contigs)

    This plot is similar to Cumulative length plot but we plot only aligned contigs instead of contigs itself. If a contig has a misassembly with respect to the reference, we break the contig into smaller pieces. This plot could be created only if the reference is given.

NAx and NGAx

    These plot are similar to the Nx and NGx plots but for NAx and NGAx metrics respectively. These plots could be created only only if the reference is given.

Genes

    This plot shows the growth rate of full genes in assemblies. The y-axis is the number of full genes in the assembly and x-axis is the number of contigs in the assembly (from the largest one to the smallest one). This plot could be created only if the reference and genes annotations file are given.

Operons

    This plot is similar to the previous one but for operons.

3. Feedback and bug reports

We will be thankful if you help us make QUAST better by sending your comments, bug reports, and suggestions to quast.support@bioinf.spbau.ru.

We kindly ask you to attach log <QUAST_output_dir>/quast.log or even archive of <QUAST_output_dir> if you have troubles running QUAST.
Note that if you didn't specify <QUAST_output_dir> manually it was created with results_<date_time> name and symbolik link latest to that directory was created.