SPAdes 2.1.0 Manual

SPAdes stands for St. Petersburg genome assembler. It is intended for both standard (multicell) and single-cell MDA bacteria assemblies. This manual will help you to install and run SPAdes. You can find the latest SPAdes release at http://bioinf.spbau.ru/spades. The latest version of this manual can be found here.

1. Installation

SPAdes requires a 64-bit Linux system. We have two test datasets: single-cell E. coli dataset and multicell E. coli dataset. Error correction for these datasets requires 85 Gb of RAM, assembling after error correction requires 30 Gb of RAM. Also SPAdes requires Python 2.4 or higher installed.

1.1 SPAdes tar.gz file

To download SPAdes tar.gz and extract it:


    wget http://spades.bioinf.spbau.ru/release2.1.0/spades_2.1.0.tar.gz
    tar -xzf spades_2.1.0.tar.gz
    cd spades-2.1.0

SPAdes depends on the following libraries for compiling its code:

If you are not able to match these requirements please use downloaded binaries as described in the next section.

1.2 Downloading binaries

Precompiled SPAdes binaries for the default parameters may be downloaded with the following scripts:


    ./spades_download_binary.py 21 33 55
    ./spades_download_bayeshammer.py

This will download binaries to build/release2.1.0 directory. SPAdes uses binaries optimized for each value of k. You may download binaries for the required values of k with the script


    ./spades_download_binary.py k1 k2

using space-separated values of k as an argument.

1.3 Testing your installation

For testing purposes, SPAdes comes with a toy dataset (first 1000 bp of E. coli). If you run spades.py with the parameter --test


    ./spades.py --test

it will process this dataset and if the installation is successful you will see something like this at the end of the log:


 * Corrected reads are in spades_output/ECOLI_1K/corrected/
 * Assembled contigs are spades_output/ECOLI_1K/spades_04.18_17.59.30/ECOLI_1K.fasta

Thank you for using SPAdes!

======= SPAdes pipeline finished

2 Running SPAdes

2.1 Input data

SPAdes accepts single reads as well as forward-reverse paired end reads in FASTA and FASTQ format; however, in order to run error correction, reads should be in FASTQ format. All files may be compressed with gzip. At present, SPAdes can accept only one paired-end library as input.

SPAdes supports paired end reads organized into one or two files.

2.2 SPAdes pipeline

SPAdes has two stages: error correction by BayesHammer and genome assembly. By default, all the results are stored in the directory spades_output. For each dataset, SPAdes creates a separate directory <project_name> in spades_output. Corrected reads are stored in the subdirectory corrected in *.fastq.qz files. The resulting contigs are stored in the subdirectory contigs in <project_name>.fasta file. If the --generate-sam-file option was set the symlink to the <project_name>.sam file also created in the same directory.

If SPAdes finds the corrected reads, it asks the user whether it should start error correction again. By default, we do not re-run error correction and we start the assembly in 10 seconds unless user says otherwise.

There are two ways to run SPAdes for you dataset: you can specify the parameters from the command line or provide the configuration file as the only ./spades.py parameter.

2.3 SPAdes command line options

SPAdes is run from the command line as follows:


    spades.py [options] -n <project_name>

Here is the description of options:

-n <project_name>

    A required option that sets the name of the project.

-o <output_dir>

    Specify the output directory. Default: spades_output.

--sc

    This flag is required for MDA (single-cell) data.

--12 <filename>

    File with interlaced left and right paired end reads.

-1 <filename>

    File with left paired end reads.

-2 <filename>

    File with right paired end reads.

-s <filename>

    File with unpaired reads

--generate-sam-file

    Generate a SAM file. See more details in section 3.

-t <int> (or --threads <int>)

    Number of threads. The default value is 16.

-m <int> (or --memory <int>)

    Sets the memory limit in Gb. SPAdes terminates if it reaches this limit. The default value is 250 Gb. Due to technical reasons, the actual physical memory consumed will be smaller than this limit.

--tmp-dir <dirname>

    Set the directory for error correction's temporary files. The default value is <output_dir>/<project_name>/corrected/tmp.

-k <int,int,...>

    Comma-separated list of increasing k-mer sizes (all values must be odd). Default is 21,33,55.

-i <int> (or --iterations <int>)

    Number of iterations for error correction. The default value is 2.

--phred-offset <33 or 64>

    PHRED quality offset for the input reads, can be either 33 or 64. It will be auto-detected if it is not specified.

--only-error-correction

    With this option we run only error correction, without the assembler.

--only-assembler

    With this option we run only assembler, without error correction.

--disable-gap-closer

    Forces SPAdes not to use the gap closer.

--disable-gzip-output

    Forces error correction not to compress the corrected reads.

--test

    Runs SPAdes on the toy dataset, see section 1.3 .

-h (or --help)

    Print help.

2.4 Preparing configuration files

You can find the configuration file spades_config.info for the toy dataset from section 1.3 in the directory where you extracted the archive (see section 1.1). You can use this file as the template for your datasets. In this file, on each line, all text after the first semicolon is a comment.

The configuration file starts with the common parameters for the SPAdes run. We recommend setting the project_name parameter that specifies the directory and names for output files. After that there is a dataset section that contains the information about input reads. This section is required and you need to specify either paired_reads or single_reads or both.

The next two sections are error_correction and assembly. If you want to skip one of these stages, you can remove or rename or comment it.

2.5 Supporting files and directories

SPAdes stores compiled binaries in the directory build and reuses them. To save disk space, or to force SPAdes to recompile the binaries, just delete this directory. Most users will not need to do this.

For each SPAdes run, we create the directory spades_<date_time> in spades_output/<project_name>. It contains internal configs, logs and intermediate results for different values of k. We also have symbolic links latest and latest_success to the directories for the latest and the latest successful runs.

3 Postprocessing

After running SPAdes you can evaluate different quality metrics using our pipeline. Please see quality.html for details.

SPAdes can generate a SAM file for further processing, but SAM files are very large, so you need to set the --generate-sam-file option in the command line. SAM file contains information about the alignment of the original reads to the resulting contigs. We recommend using Tablet [3] to visualize these alignments.

Also we recommend to use SEQuel [4] after SPAdes to further reduce the number of small errors (single nucleotide substitutions and small indels). It also requires the SAM-file as input.

The current version of SPAdes does not have a scaffolding stage. One can use a separate scaffolder such as Opera [2].

4 Feedback and bug reports

We will be thankful if you help us make SPAdes better by sending your comments, bug reports, and suggestions to spades.support@bioinf.spbau.ru.

We kindly ask you to attach file params.txt and logs corrected/correction.log and contigs/assembly.log if you have troubles running SPAdes. These files are placed in the directory <output_dir>/<project_name>.

References

[1] A. Bankevich, S. Nurk, et al. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology. May 2012, 19(5): 455-477. doi:10.1089/cmb.2012.0021.

[2] Song Gao, Wing-Kin Sung, and Niranjan Nagarajan. Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. Journal of Computational Biology. November 2011, 18(11): 1681-1691. doi:10.1089/cmb.2011.0170.

[3] I. Milne, M. Bayer, L. Cardle, P. Shaw, G. Stephen, F.Wright, and D. Marshall. Tablet — next generation sequence assembly visualization. Bioinformatics (2010) 26 (3): 401-402. doi: 10.1093/bioinformatics/btp666

[4] R. Ronen, C. Boucher, H. Chitsaz, and P. Pevzner. SEQuel: Improving the accuracy of genome assemblies. Bioinformatics, 2012. To appear.