QUAST stands for QUality ASsessment Tool.
The tool evaluates genome assemblies by computing various metrics.
You can find all project news and the latest version of the tool at http://sourceforge.net/projects/quast.
QUAST utilizes MUMmer, GeneMarkS, GeneMark-ES, GlimmerHMM, and GAGE. In addition, MetaQUAST uses MetaGeneMark, BLAST, and SILVA 16S rRNA database. These tools are built in into the QUAST package which is ready for use by academic, non-profit institutions and U.S. Government agencies. If you are not in one of these categories please refer to LICENSE section 'Third-party tools incorporated into QUAST' for guidelines on how to complete the licensing process.
Version 3.0 of QUAST was released under GPL v2 on XX July 2015. Note that some of build-in third-party tools are not under GPL v2. See LICENSE for details.
QUAST can be run on Linux or Mac OS.
ar, so you will have to install Xcode (or only Command Line Tools for Xcode) to make them available.
pip install matplotlibOr with the Easy Install Python module:
easy_install matplotlibOr on Ubuntu by typing:
sudo apt-get install python-matplotlib
wget https://downloads.sourceforge.net/project/quast/quast-3.0.tar.gz tar -xzf quast-3.0.tar.gz cd quast-3.0
QUAST automatically compiles all its sub-parts when needed (on the first use). Thus, there is no special installation command for QUAST. However, we recommend you to run:
python quast.py --test (if you plan to use quast.py)or/and
python metaquast.py --test (if you plan to use metaquast.py with references)or/and
python metaquast.py --test-no-ref (if you plan to use metaquast.py without references)These commands run all QUAST and MetaQUAST modules and check correctness of their work on your platform.
./install.shscript which runs all three mentioned above commands.
Note: you should place quast-3.0 directory in the final destination before the first use (e.g. before run with --test). If you want to move QUAST to some new place after several usages you should use a clean copy of quast-3.0. This limitation is caused by auto-generation of absolute paths in compiled modules of QUAST.
./quast.py test_data/contigs_1.fasta \ test_data/contigs_2.fasta \ -R test_data/reference.fasta.gz \ -G test_data/genes.txt \ -O test_data/operons.txt \View the summary of the evaluation results with the less utility:
test_data directory contains examples of assembly, reference, gene and operon files.
The tool accepts assemblies and references in FASTA format. Files may be compressed with zip, gzip, or bzip2.
Multiple reference chromosomes can be provided as separate sequences in a single FASTA file.
Maximum assembly length is 4.29 Gbp.
Maximum length of a reference sequence (e.g. a chromosome) is 536 Mbp. The number of sequences in a reference file is not limited.
Those restrictions belongs to Nucmer, a tool that QUAST applies to align contigs to a reference genome. The metrics that do not require alignment are computed in any case.
Genes and operons
One can also specify files with gene and operon positions in the reference. QUAST will count fully and partially aligned regions, and output total values and cumulative plots.
The following file formats are supported:
GAGE is a well-known assessment tool. However, it has limitations:
--gageoption). QUAST filters contigs according to a specified threshold and runs GAGE on every assembly. GAGE statistics (see the GAGE site and the GAGE paper for descriptions) are reported in addition to a standard QUAST report.
python quast.py [options] <contig_file(s)>Options:
metaquast.py script accepts multiple references. One can provide several files or directory with multiple reference files inside.
The tool partitions all contigs in groups aligned to each reference. Note that a contig may belong to several groups simultaneously if it aligns to several references.
MetaQUAST runs quast.py for each of the following:
If you run MetaQUAST without provided references, the tool will attempts to identify which organisms is presented in metagenome.
MetaQUAST uses BLASTN for aligning contigs to FASTA file, containing small subunit ribosomal RNA sequences (SILVA rRNA database).
For each assembly, 30 reference genomes with maximal score are chosen (by default). Maximum number of references to download can be specified with the option
Reference genomes for the chosen organisms are downloaded from NCBI database to
After first running
quast.py, MetaQUAST removes reference genomes with low genome fraction (less than 10%) and run
quast.py with remaining references.
Notable, for all references in combination MetaQUAST runs quast.py with option
python metaquast.py contigs_1 contigs_2 ... -R reference_1,reference_2,reference_3,...
All options are the same as for
quast.py, except for
-R: it can accept multiple references (comma-separated without spaces in between) or directory with references.
If an output path was not specified manually, QUAST puts its output into the directory
quast_results/result_<DATE> and creates a symlink
latest to it inside the directory
QUAST output contains:
|report.txt||an assessment summary in a simple text format,|
|report.tsv||a tab-separated version of the summary, suitable for spreadsheets (Google Docs, Excel, etc),|
|report.tex||a LaTeX version of the summary,|
|alignment.svg||a contig alignment plot (file is created if the matplotlib python library is installed),|
|report.pdf||all other plots combined with all tables (file is created if the matplotlib python library is installed),|
|report.html||an HTML version of the report with interactive plots inside it,|
|misassemblies_report||a detailed report on misassemblies. See section 3.1.2 for details,|
|unaligned_report||a detailed report on unaligned and partially unaligned contigs. See section 3.1.3 for details.|
# contigs (≥ x bp)
is total number of contigs of length
≥ x bp.
Not affected by the
--min-contig parameter (see section 2.4).
Total length (≥ x bp)
is the total number of bases in contigs of length
≥ x bp.
Not affected by the
--min-contig parameter (see section 2.4).
All remaining metrics are computed only the contigs that exceed the threshold specified by specified by the
--min-contig option (see section 2.4, default is 500).
# contigs is the total number of contigs in the assembly.
Total length is the total number of bases in the assembly.
Reference length is the total number of bases in the reference.
Reference GC (%) is the percentage of G and C nucleotides in the reference.
N75 and NG75 are defined similarly with 75 % instead of 50 %.
L50 (L75, LG50, LG75) is the number of contigs as long as N50 (N75, NG50, NG75)
In other words, L50, for example, is the minimal number of contigs that cover half the assembly.
# misassemblies is the number of positions in the contigs that satisfy one of the following criteria:
# misassembled contigs is the number of contigs that contain misassembly events.
Misassembled contigs length is the total number of bases in misassembled contigs.
# local misassemblies is the number of breakpoints that satisfy the following conditions:
# unaligned contigs is the number of contigs that have no alignment to the reference sequence. The value "X + Y part" means X totally unaligned contigs plus Y partially unaligned contigs.
Unaligned length is the total length of all unaligned regions in the assembly (sum of lengths of fully unaligned contigs and unaligned parts of partially unaligned ones).
Genome fraction (%) is the percentage of aligned bases in the reference. A base in the reference is aligned if there is at least one contig with at least one alignment to this base. Contigs from repetitive regions may map to multiple places, and thus may be counted multiple times.
Duplication ratio is the total number of aligned bases in the assembly divided by the total number of aligned bases in the reference (see Genome fraction (%) for the 'aligned base' definition). If the assembly contains many contigs that cover the same regions of the reference, its duplication ratio may be much larger than 1. This may occur due to overestimating repeat multiplicities and due to small overlaps between contigs, among other reasons.
# N's per 100 kbp is the average number of uncalled bases (N's) per 100000 assembly bases.
# mismatches per 100 kbp is the average number of mismatches per 100000 aligned bases. True SNPs and sequencing errors are not distinguished and are counted equally.
# indels per 100 kbp is the average number of indels per 100000 aligned bases. Several consecutive single nucleotide indels are counted as one indel.
# genes is the number of genes in the assembly (complete and partial), based on a user-provided
list of gene positions in the reference. A gene 'partially covered' if the assembly contains at least 100 bp
of this gene but not the whole one.
This metric is computed only if a reference genome and an annotated list of gene positions are provided (see section 2.4).
# operons is defined similarly to # genes, but an operon positions file required instead.
Largest alignment is the length of the largest continuous alignment in the assembly. A value can be smaller than a value of largest contig if the largest contig is misassembled.
NA50, NGA50, NA75, NGA75, LA50, LA75, LGA50, LGA75 ("A" stands for "aligned") are similar to
the corresponding metrics without "A", but in this case aligned blocks instead of contigs are considered.
Aligned blocks are obtained by breaking contigs in misassembly events and removing all analigned bases.
# misassemblies is the same as # misassemblies from section 3.1.1. However, this report also contains a classification of all misassemblies into three groups: relocations, translocations, and inversions (see below). For metagenomic assemblies, this classification also includes interspecies translocation.
Relocation is a misassembly where the left flanking sequence aligns over 1 kbp away from the right flanking
sequence on the reference, or they overlap by more than 1 kbp, and both flanking sequences align on the same chromosome. Note that default threshold of 1 kbp can be
Translocation is a misassembly where the flanking sequences align on different chromosomes.
Interspecies translocation is a misassembly where the flanking sequences align on different references (MetaQUAST only).
Inversion is a misassembly where the flanking sequences align on opposite strands of the same chromosome.
# misassembled contigs and misassembled contigs length are the same as the metrics from section 3.1.1 and are counted among all contigs with any type of a misassembly (relocation, translocation, interspecies translocation or inversion).
# possibly misassembled contigs is the number of contigs that contain large unaligned fragment and thus could possibly contain interspecies translocation with unknown reference (MetaQUAST only).
# local misassemblies is the same as # local misassemblies from section 3.1.1.
# mismatches is the number of mismatches in all aligned bases.
# indels is the number of indels in all aligned bases.
# short indels (≤ 5 bp) is the number of indels of length
≤ 5 bp.
# long indels (> 5 bp) is the number of indels of length
> 5 bp.
Indels length is the total number of bases contained in all indels.
Note: Nucmer's default maximum length of indel is 85 bp. All indels larger than 85 bp are considered as local misassemblies.
# fully unaligned contigs is the number of contigs that have no alignment to the reference sequence.
Fully unaligned length is the total number of bases in all unaligned contigs.
# partially unaligned contigs is the number of contigs that are not fully unaligned, but have fragments with no alignment to the reference sequence.
# with misassembly is the number of partially unaligned contigs that have a misassembly in their aligned fragment. Note that such misassemblies are not counted in # misassemblies and other misassemblies statistics.
# both parts are significant is the number of partially unaligned contigs that have both aligned and unaligned fragments
longer than the value of
Partially unaligned length is the total number of unaligned bases in all partially unaligned contigs.
# N's is the total number of uncalled bases (N's) in the assembly.
Contig alignment plot shows alignment of contigs to the reference genome and the positions of misassemblies in these contigs. Contigs that align correctly are colored blue if the boundaries agree (within 2 kbp on each side, contigs are larger than 10 kbp) in at least half of the assemblies, and green otherwise. Blocks of misassembled contigs are colored orange if the boundaries agree in at least half of the assemblies, and red otherwise. Contigs are staggered vertically and are shown in different shades of their color in order to distinguish the separate contigs, including small ones. If the reference file consists of several sequences all of them are drawn on the single plot horizontally next to each other.
Cumulative length plot shows the growth of contig lengths. On the x-axis, contigs are ordered from the largest to smallest. The y-axis gives the size of the x largest contigs in the assembly.
Nx plot shows Nx values as x varies from 0 to 100 %.
NGx plot shows NGx values as x varies from 0 to 100 %.
GC content plot shows the distribution of GC content in the contigs.
The x value is the GC percentage (0 to 100 %).
The y value is the number of non-overlapping 100 bp windows which GC content equals x %.
For a single genome, the distribution is typically Gaussian. However, for assemblies with contaminants, the GC distribution appears to be a superposition of Gaussian distributions, giving a plot with multiple peaks.
Cumulative length plot for aligned contigs shows the growth of lengths of aligned blocks.
If a contig has a misassembly, QUAST breaks it into smaller pieces called aligned blocks.
On the x-axis, blocks are ordered from the largest to smallest. The y-axis gives the size of the x largest aligned blocks.
This plot is created only if a reference genome is provided.
NAx and NGAx plots
These plots are similar to the Nx and NGx plots but for the NAx and NGAx metrics respectively. These plots are created only if a reference genome is provided.
Genes plot shows the growth rate of full genes in assemblies.
The y-axis is the number of full genes in the assembly, and the x-axis is the number of contigs in the assembly (from the largest one to the smallest one).
This plot could be created only if a reference and genes annotations files are given.
Operons plot is similar to the previous one but for operons.
All outputs are in separate directories inside the directory provided by
-o (or in quast_results/latest).
Also, plots and reports for each metric as well as combined HTML report are saved in
These plots are created for each metric to show its values for all assemblies vs all references. References on the plot are sorted by the mean value of this metric in all assemblies. References are always sorted from the best results to the worst ones, thus the plot can be descending or ascending depend on the metric.
Summary text reports
These files contain the same information as the summary plots, but in text format.
Summary HTML-report is created on the basis of HTML-report in
combined_quast_output/. Each row is expandable and contains data for all references.
You can view results separately for each reference by clicking on a row preceeded by plus sign:
Note that values for some metrics like # contigs may not sum up, because one contig may be aligned to multiple references.
You can easily change content, order of metrics, and metric names in all QUAST reports. For doing this,
please edit the
CONFIGURABLE PARAMETERS section in
libs/reporting.py. It contains a lot of informative comments,
which will help you to adjust QUAST reports easily even if you are new to Python.
You can also adjust plot colors, style and width of lines, legend font, plots output format, etc.
Please see the
CONFIGURABLE PARAMETERS section in
Note: if you restart QUAST on the same directory with new parameters, is will reuse alignments and run much faster.
See the description of the
-o option in section 2.4.
We will be thankful if you help us make QUAST better by sending your comments, bug reports, and suggestions to email@example.com.
We kindly ask you to attach the
quast.log file from output directory (or an entire archive of the folder) if you have troubles running QUAST.
Note that if you didn't specify the output directory manually, it is going to be automatically set to
quast_results/results_<date_time>, with a symbolic link
quast_results/latest to that directory.
This section contains most popular questions about QUAST output. Read answers for deeper understanding of results generated by the tool.
In several answers there are descriptions of files under
If you use the command-line version of QUAST you specify
<quast_output_dir> by -o option or it is
"quast_results/latest" by default.
If you use http://quast.bioinf.spbau.ru/ you should download full report by pressing
"Download report" button (at top-right corner),
decompress result and go to
Q1. It seems that QUAST is giving me a differing number of misassemblies and misassembled contigs. Does this imply that QUAST looks for multiple misassemblies within one contig?
Yes, you are right, QUAST looks for multiple misassemblies within one contig. Thus, number of misassemled contigs is always less or equal to number of misassemblies.
Q2. Is there a way to get only misassembled contigs of the assembly?
Yes, there is such way.
QUAST copies all misassembled contigs of
"<assembly_name>" assembly into
E.g. if your assembly is "contigs.fasta" then the file is "contigs.mis_contigs.fa", if your assembly is "ecoli_assembly_1.fasta" then the file is "ecoli_assembly_1.mis_contigs.fa".
Q3. Is it possible to find which misassembly corresponds to each contig and which kind of a misassembly it is?
Yes, it is possible. QUAST produces report with detailed info about each contig alignments and the short version with only extensive misassemlies records.
Let's start with the short one. It is saved to
<quast_output_dir>/contigs_reports/contigs_report_<assembly_name>.mis_contigs.info. E.g. if your assembly is "contigs.fasta" then the file is "contigs_report_contigs.mis_contigs.info", if your assembly is "ecoli_assembly_1.fasta" then the file is "contigs_report_ecoli_assembly_1.stdout".
The content of this file looks like this:
Extensive misassembly ( inversion ) between 287 575 and 296 1
Extensive misassembly ( relocation, inconsistency = 2655 ) between 16800 18907 and 18905 20382
Let's move to the detailed report. Here you can find information about all misassembled, unaligned and properly aligned contigs. This report is saved to
<quast_output_dir>/contigs_reports/contigs_report_<assembly_name>.stdout file. E.g. if your assembly is "contigs.fasta" then the file is "contigs_report_contigs.stdout", if your assembly is "ecoli_assembly_1.fasta" then the file is "contigs_report_ecoli_assembly_1.mis_contigs.info".
To get info about misassemblies, you should look for "Extensive misassembly" words in the report and look around to detect contig name which corresponds this misassembly.
Look at the following example:
CONTIG: NODE_772 (575bp)
Top Length: 296 Top ID: 100.0
Skipping redundant alignment 1096745 1096882 | 138 1 | 138 138 | 98.55 | Escherichia_coli NODE_772
This contig is misassembled. 3 total aligns.
Real Alignment 1: 924846 925134 | 287 575 | 289 289 | 100.0 | Escherichia_coli NODE_772
Extensive misassembly ( inversion ) between these two alignments
Real Alignment 2: 924906 925201 | 296 1 | 296 296 | 100.0 | Escherichia_coli NODE_772
Here is another example:
CONTIG: Contig_753 (140518bp)
Top Length: 121089 Top ID: 99.98
Skipping redundant alignments after choosing the best set of alignments
Skipping redundant alignment 273398 273468 | 18977 18907 | 71 71 | 100.0 | Escherichia_coli Contig_753
Skipping redundant alignment 3363797 3363867 | 18977 18907 | 71 71 | 100.0 | Escherichia_coli Contig_753
This contig is misassembled. 14 total aligns.
Real Alignment 1: 1425621 1426074 | 19431 18978 | 454 454 | 100.0 | Escherichia_coli Contig_753
Gap between these two alignments (local misassembly). Inconsistency = 148
Real Alignment 2: 1426295 1426818 | 18905 18382 | 524 524 | 100.0 | Escherichia_coli Contig_753
Extensive misassembly ( relocation, inconsistency = 2224055 ) between these two alignments
Real Alignment 3: 3650278 3650348 | 18977 18907 | 71 71 | 100.0 | Escherichia_coli Contig_753
Extensive misassembly ( relocation, inconsistency = 236807 ) between these two alignments
Real Alignment 4: 3765544 3886652 | 140518 19430 | 121109 121089 | 99.98 | Escherichia_coli Contig_753
Extensive misassembly ( relocation, inconsistency = -1052 ) between these two alignments
Real Alignment 5: 3886649 3905037 | 18381 1 | 18389 18381 | 99.96 | Escherichia_coli Contig_753
Q4. Could you explain the format of Real Alignments in contigs report files (see the answer for Q3)?
Yes, sure. Let's look at the following example:
Real Alignment 1: 19796 20513 | 29511 30228 | 718 718 | 100.0 | ENA|U00096|U00096.2_Escherichia_coli contig-710
The next two numbers (in this case: 718 718) mean "the number of aligned bases on the target" and "the number of aligned bases on the query". They are usually equal to each other but they can be slightly different because of short insertions and deletions. Actually, these numbers are excessive because they can be easily calculated based on the first two pairs of numbers (positions on the target and positions on the query). However, sometimes it is convenient to look at these numbers.
The last number (in this case: 100.0) is the Nucmer aligner quality metric. It is called "identity %" (IDY %) and it describes the quality of the alignment (the number of mismatches and indels between the target and the query). If IDY% = 100.0 then the alignment is perfect, i.e. all bases on the target and on the query are equal to each other. If IDY% is less than 100.0 then the target and the query are slightly different. Quast has a threshold on IDY% which is 95%. Thus we don't use alignments with IDY% less than 95% (they are relatively bad).
And finally, the last two columns are the name of the target sequence (i.e. reference name) and the name of the query (i.e. contig name).
Q5. Where does QUAST save information about SNPs?
There are two output files concerning SNPs. Both of them are saved in
The first one has extension ".all_snps" and it is raw Nucmer aligner output. Its format is:
[P1] [SUB] [SUB] [P2] [BUFF] [DIST] [R] [Q] [FRM] [TAGS]
15383 T G 3339560 1 15383 3 2 1 -1 Escherichia_coli contig_15
R and Q specify the number of other alignments which overlap this position (in Reference and Query (i.e. contig) respectively). FRM and TAGS are not documented in Nucmer help message, and the last two columns are reference name and contig name.
The second file ("*.used_snps") is generated by QUAST.
We analyse all alignments and filter them by skipping some "uninformative" alignments (redundant, duplicated) and after that include in ".used_snps" file only those of all SNPs which were actually appear in filtered alignments. Thus, reported by QUAST numbers of "# mismatches per 100 kbp", "# indels per 100 kbp" includes statistics from USED SNPs, not ALL SNPs.
In addition, we use our own format of ".used_snps" file.
Escherichia_coli contig_15 728803 C . 3217983
Q6. What does "broken" version of an assembly refer to while assessing scaffolds' quality (--scaffolds option)?
Actually, the difference between "broken" and original assembly (scaffolds) is very simple. QUAST splits input fasta by continuous fragments of N's of length ≥ 10 and call this a "_broken" assembly. By doing this we try to reconstruct "contigs" which were used for construction of the scaffolds. After that, user can compare results for real scaffolds and "reconstructed contigs" and find out whether scaffolding step was useful or not.
If you have both contigs.fasta and scaffolds.fasta it is better to specify both files to QUAST and don't set "--scaffolds" option. The comparison of real contigs vs real scaffolds is more honest and informative than scaffolds vs scaffolds_broken.
To sum up, you should use "--scaffolds" option if you don't have original file with contigs but want to compare your scaffolds with it.
Q7. Can I add new assemblies to existing QUAST report without need to realign already processed assemblies? Or can I at least rerun existing QUAST report with slightly modified options set?
Yes, sure! You just need to specify existing QUAST output directory with
-o option and our tool
will reuse already generated Nucmer alignments and will run alignment process only for new assemblies.
Note that all of QUAST options except
--min-contig do not affect Nucmer alignment process,
so you can rerun previous QUAST command with modified options and QUAST will reuse existing alignments also.
Hint: if you run QUAST without specifying output dir with
-o option you can rerun it on the same directory