rnaQUAST 0.2.1 manual

1. About rnaQUAST
2. Installation & Requirements
3. Options
    3.1. Input data options
    3.2. Basic options
    3.3. Advanced options
4. Understanding rnaQUAST output
    4.1. Reports
    4.2. Detailed output
    4.3. Plots
5. Citation
6. Feedback and bug reports

1 About rnaQUAST

rnaQUAST is a tool for evaluating RNA-Seq assemblies using reference genome and annotation. rnaQUAST version 0.2.1 was released under GPLv2 on June 10th, 2015 and can be downloaded from http://bioinf.spbau.ru/en/rnaquast.

For impatient people:

2 Installation & Requirements

To run rnaQUAST you need:

Note, that due to the limitations of blat, in order to work with reference genomes of size more than 4 Gb a pslSort is also required.

Paths to BLASTN and BLAT should be added to the $PATH environmental variable. To check that everything is installed correctly we recommend to run:

python rnaQUAST.py --test

3 Options

3.1 Input data options

To run rnaQUAST one needs to provide either FASTA files with transcripts (recommended), or align transcripts to the reference genome manually and provide the resulting PSL files.

-r <REFERENCE>, --reference <REFERENCE>
    Single file with reference genome containing all chromosomes/scaffolds in FASTA format (preferably with *.fasta, *.fa, *.fna, *.ffn or *.frn extension) OR
    *.txt file containing the one-per-line list of FASTA files with reference sequences.

-gtf <ANNOTATION>, --annotation <ANNOTATION>
    File with annotation in GTF/GFF format.

-c <TRANSCRIPTS ...>, --transcripts <TRANSCRIPTS, ...>
     File(s) with transcripts in FASTA format separated by space.

-psl <ALIGNMENTS ...>, --alignment <ALIGNMENTS, ...>
     File(s) with transcripts alignments in PSL format separated by space.

3.2 Basic options

-o <OUTPUT_DIR>, --output_dir <OUTPUT_DIR>
     Directory to store all results. Default is rnaQUAST_results/results_<datetime>.

     Run rnaQUAST on the test data from the test_data folder, output directory is rnaOUAST_test_output.

-d, --debug
     Report detailed information, typically used only when detecting problems.

-h, --help
     Show help message and exit.

3.3 Advanced options

-t <INT>, --threads <INT>
     Maximum number of threads. Default is the number of CPU cores (detected automatically).

-l <LABELS ...>, --labels <LABELS ...>
     Names of assemblies that will be used in the reports separated by space.

-ss, --strand_specific
     Set if transcripts were assembled using strand specific RNA-Seq data in order to benefit from knowing whether the transcript originated from the + or - strand.

--min_alignment <MIN_ALIGNMENT>
     Minimal alignment size to be used, default value is 50.

     Do not draw plots (makes rnaQUAST run a bit faster).

-C, --cegma
     -- Run CEGMA tool, which detects core eukaritoic genes in the assembly. CEGMA should be added to the $PATH variable.

4 Understanding rnaQUAST output

Below we describe metrics, statistics and plots generated by rnaQUAST. Metrics highlighted with bold italic are considered as the most important and are included in the short summary report.

4.1 Reports

The following text files with reports are contained in comparison_output directory and contain results for all input assemblies. In addition, these report are contained in <assembly_label>_output directories for each separate assembly.

Gene database metrics.

Basic transcripts metrics calculated without reference genome and gene database.

Alignment metrics calculated with reference genome but without gene database. To calculate the following metrics rnaQUAST filters all short partial alignments (see
options) and attempts to select the best hits for each transcript.

Number of assembled transcripts = Unaligned + Aligned = Unaligned + (Uniquely aligned + Multiply aligned + Misassembly candidates reported by BLAT).

Alignment metrics for non-misassembled transcripts

Alignment metrics for misassembled (chimeric) transcripts calculated with reference genome but without gene database.

Assembly completeness (sensitivity). For the following metrics (calculated with reference genome and gene database) rnaQUAST attempts to select best-matching database isoforms (here and after simply isoforms) for every transcript. Note that a single transcript can contribute to multiple isoforms in the case of, for example, paralogous genes or genomic repeats. At the same time, an isoform can be covered by multiple transcripts in the case of fragmented assembly or duplicated transcripts in the assembly.

CEGMA metrics. The following metrics are calculated only when --cegma option is used (see options for details).

Assembly specificity. To compute the following metrics we use only transcripts that have at least one significant alignment and are not misassembled.

4.2 Detailed output

These files are contained in <assembly_label>_output directories for each separate assembly.

4.3 Plots

The following plots are similarly contained in both comparison_output directory and <assembly_label>_output directories. Please note, that most of the plots represent cumulative distributions and some plots are given in logarithmic scale.




5 Citation

The paper is submitted.

6 Feedback and bug reports

Your comments, bug reports, and suggestions are very welcomed. They will help us to further improve rnaQUAST. If you have any troubles running rnaQUAST, please send us rnaquast.log from the output directory. Address for communications: rnaquast_support@ablab.spbau.ru.