SPAdes 2.4.0 Manual

1. About SPAdes
    1.1. Supported data types
    1.2. SPAdes pipeline
    1.3. SPAdes performance
2. Installation
    2.1. Downloading SPAdes Linux binaries
    2.2. Downloading SPAdes binaries for Mac
    2.3. Downloading and compiling SPAdes source code
    2.4. Verifying your installation
3. Running SPAdes
    3.1. SPAdes input
    3.2. SPAdes command line options
    3.3. SPAdes output
    3.4. Assembly evaluation
4. Citation
5. Feedback and bug reports

1. About SPAdes

SPAdes – St. Petersburg genome assembler – is intended for both standard isolates and single-cell MDA bacteria assemblies. This manual will help you to install and run SPAdes. SPAdes version 2.4.0 was released under GPLv2 on 26 February 2013 and can be downloaded from http://bioinf.spbau.ru/spades.

1.1 Supported data types

The current version of SPAdes works only with Illumina reads. Support for other technologies (e.g. Roche 454, IonTorrent, PacBio) is currently in progress and probably will be included in one of the next releases.

SPAdes supports paired-end reads as well as unpaired reads. So far SPAdes takes as input only one paired-end library. We are currently working on mate-pairs and multiple libraries support – it is likely to come in the next major release.

Also note, that SPAdes was initially designed for single-cell and standard bacterial data sets and is not intended for larger genomes (e.g. mammalian size genomes) and metagenomic projects. For such purposes you can use it at your own risk.

1.2 SPAdes pipeline

SPAdes comes in three separate modules:

We recommend to run SPAdes with BayesHammer to obtain high-quality assemblies. However, if you use your own read correction tool it is possible to turn BayesHammer off. It is also possible to use only the read error correction stage, if you wish to use another assembler. See the SPAdes options section.

1.3 SPAdes' performance

In this section we give approximate data about SPAdes' performance on two data sets:

We ran SPAdes with default parameters using 16 threads on a server with Intel Xeon 2.27GHz processors. BayesHammer runs in approximately 1 hour and takes up to 15Gb of RAM to perform read error correction on each data set. Assembly takes 35 minutes and 4Gb of RAM for the E. coli isolate data set. The E. coli single-cell data set takes 55 minutes and 6Gb of RAM. MismatchCorrector runs for about an hour on standard E. coli, about 2 hours on the single-cell data set, and requires 8Gb of RAM. All modules also require additional disk space for temporary files. See the table below for more precise values.

Data set   E.coli isolate E.coli single-cell
Stage   Time (min)     Peak RAM  
  usage (Gb)  
  Additional  
  disk space (Gb)  
  Time (min)     Peak RAM  
  usage (Gb)  
  Additional  
  disk space (Gb)  
BayesHammer 75 14 13 75 15 13
SPAdes 35 4 8 55 6 8
MismatchCorrector 60 8 24 120 8 24

Notes:

2. Installation

SPAdes requires a 64-bit Linux system or Mac OS and Python 2.4 or higher (but not 3) to be pre-installed on it. To obtain SPAdes you can either download binaries or download source code and compile it yourself.

2.1 Downloading SPAdes Linux binaries

To download SPAdes Linux binaries and extract them, run:


    wget http://spades.bioinf.spbau.ru/release2.4.0/SPAdes-2.4.0-Linux.tar.gz
    tar -xzf SPAdes-2.4.0-Linux.tar.gz
    cd SPAdes-2.4.0-Linux/bin/

In this case you do not need to run any installation scripts – SPAdes is ready to use. The following files will be placed in bin directory:

2.2 Downloading SPAdes binaries for Mac

To obtain SPAdes binaries for Mac, run:


    curl http://spades.bioinf.spbau.ru/release2.4.0/SPAdes-2.4.0-Darwin.tar.gz -o SPAdes-2.4.0-Darwin.tar.gz
    tar -zxvf SPAdes-2.4.0-Darwin.tar.gz
    cd SPAdes-2.4.0-Darwin/bin/

Just as in Linux, SPAdes is ready to use and no further installation steps are required. You will get the same files in bin directory:

2.3 Downloading and compiling SPAdes source code

If you wish to compile SPAdes by yourself you will need the following libraries to be pre-installed:

If you meet these requirements, you can download SPAdes source code:


    wget http://spades.bioinf.spbau.ru/release2.4.0/SPAdes-2.4.0.tar.gz
    tar -xzf SPAdes-2.4.0.tar.gz
    cd SPAdes-2.4.0

and build it with the following script:


    ./spades_compile.sh

In this case SPAdes will be built into ./bin directory. If you wish to install SPAdes into another directory, you can specify full path of destination folder by running:


    PREFIX=<destination_dir> ./spades_compile.sh

for example:


    PREFIX=/usr/local ./spades_compile.sh

which will install SPAdes into /usr/local/bin.

After installation you will get the same files in ./bin (or <destination_dir>/bin if you specified PREFIX) directory:

2.3 Verifying your installation

For testing purposes, SPAdes comes with a “toy data set” (reads, that align to first 1000 bp of E. coli). To try SPAdes on this data set run:


    ./spades.py --test

If the installation is successful, you will find the following information at the end of the log:


 * Corrected reads are in spades_test/corrected/
 * Assembled contigs are in spades_test/contigs.fasta
 * Assembled scaffolds are in spades_test/scaffolds.fasta

Thank you for using SPAdes!

======= SPAdes pipeline finished. Log can be found here: spades_test/spades.log

3. Running SPAdes

3.1 SPAdes input

SPAdes takes as input forward-reverse paired-end reads as well as single (unpaired) reads in FASTA or FASTQ format. However, in order to run read error correction, reads should be in FASTQ format. Currently SPAdes accepts only one paired-end library, which can be stored in several files or several pairs of files. The number of unpaired libraries is unlimited.

Paired-end reads can be organized in two different ways:

For example, Illumina produces paired-end reads in two files: s_1_1_sequence.txt and s_1_2_sequence.txt. If you choose to store reads in file pairs make sure that for every read from s_1_1_sequence.txt the corresponding paired read from s_1_2_sequence.txt is placed in the respective paired file on the same line number. If you choose to use merged files, every read from s_1_1_sequence.txt should be followed by the corresponding paired read from s_1_2_sequence.txt.

Note that SPAdes also accepts gzip-compressed files.

3.2 SPAdes command line options

To run SPAdes from the command line, type


    ./spades.py [options] -o <output_dir>

Basic options

-o <output_dir>
    Specify the output directory. Required option.

--sc
    This flag is required for MDA (single-cell) data.

--12 <file_name>
    File with interlaced forward and reverse paired-end reads.

-1 <file_name>
    File with forward reads.

-2 <file_name>
    File with reverse reads.

-s <file_name>
    File with unpaired reads.

--test
    Runs SPAdes on the “toy data set”; see section 2.3.

-h (or --help)
    Prints help.

Pipeline options

--only-error-correction
    Perfrorms read error correction only.

--only-assembler
    Runs assembly module only.

--careful
    Tries to reduce number of mismatches and short indels. Also runs MismatchCorrector – a post processing tool, which uses
BWA tool (comes with SPAdes).

--disable-gzip-output
    Forces read error correction module not to compress the corrected reads. If this options is not set, corrected reads will be in *.fastq.gz format.

--rectangles
    Uses rectangle graph algorithm for repeat resolution stage instead of usual SPAdes repeat resolution module (experimental).

Advanced options

-t <int> (or --threads <int>)
    Number of threads. The default value is 16.

-m <int> (or --memory <int>)
    Set memory limit in Gb. SPAdes terminates if it reaches this limit. The default value is 250 Gb. Actual amount of consumed RAM will be below this limit.

--tmp-dir <dir_name>
    Set directory for temporary files from read error correction. The default value is <output_dir>/corrected/tmp.

-k <int,int,...>
    Comma-separated list of k-mer sizes to be used (all values must be odd, less than 128 and listed in ascending order). The default value is 21,33,55.

-i <int> (or --iterations <int>)
    Number of iterations for read error correction. The default value is 1.

--phred-offset <33 or 64>
    PHRED quality offset for the input reads, can be either 33 or 64. It will be auto-detected if it is not specified.

--debug
    Runs SPAdes in debug mode, keeping intermediate results.

Examples

To test “toy data set” you can also run:


    ./spades.py -1 ../share/spades/test_dataset/ecoli_1K_1.fq.gz -2 ../share/spades/test_dataset/ecoli_1K_2.fq.gz -o spades_test

If you have your library separated into several pairs of files, for example:


    pe_forward_1.fastq
    pe_reverse_1.fastq
    pe_forward_2.fastq
    pe_reverse_2.fastq

make sure that corresponding files go in the same order:


    ./spades.py -1 pe_forward_1.fastq -2 pe_reverse_1.fastq -1 pe_forward_2.fastq -2 pe_reverse_2.fastq -o spades_output

Files with interlacing paired-end reads or files with unpaired reads can be specified in any order with one file per option, for example:


    ./spades.py --12 pe_1.fastq --12 pe_2.fastq -s unpaired_1.fastq -s unpaired_2.fastq -o spades_output    

Options -1, -2, --12 and -s can be mixed together if needed.

3.3 SPAdes output

SPAdes stores all output files in <output_dir> , which is set by user.

Note: scaffolds are not produced if no paired library is provided.

Full list of <output_dir> content is presented below:

    contigs.fastaresulting contigs
    scaffolds.fastaresulting scaffolds (will not be produced if input reads are unpaired)

    corrected/files from read error correction
        configs/configuration files for read error correction
        dataset.infointernal configuration file
        Output files with corrected reads

    params.txtinformation about SPAdes parameters in this run
    spades.logSPAdes log
    dataset.infointernal configuration file
    K<##>/directory containing files from the run with K=<##> (K21, K33 and K55 are created by default)

SPAdes will overwrite these files and directories if they exist in the specified <output_dir>.

3.4 Assembly evaluation

We recommend to use QUAST for assembly evaluation.

4. Citation

If you use SPAdes in your research, please include Bankevich, Nurk et al., 2012 in your reference list.

In addition, we would like to list your publications that use our software on our website. Please email the reference, the name of your lab, department and institution to spades.support@bioinf.spbau.ru.

5. Feedback and bug reports

Your comments, bug reports, and suggestions are very welcomed. They will help us to further improve SPAdes.

If you have trouble running SPAdes, please provide us with the files params.txt and spades.log from the directory <output_dir>.

Address for communications: spades.support@bioinf.spbau.ru.