SPAdes 2.2.0 Manual

SPAdes stands for St. Petersburg genome assembler. It is intended for both standard (multicell) and single-cell MDA bacteria assemblies. This manual will help you to install and run SPAdes. You can find the latest SPAdes release at http://bioinf.spbau.ru/spades. The latest version of this manual can be found here. SPAdes version 2.2.0 released under GPLv2 on 2 August 2012.

1. Installation

SPAdes requires a 64-bit Linux system. We have two test datasets: single-cell E. coli dataset and multicell E. coli dataset. SPAdes requires 35 Gb of RAM for processing these datasets. Also SPAdes requires Python 2.4 or higher installed.

1.1 SPAdes tar.gz file

To download SPAdes tar.gz and extract it:


    wget http://spades.bioinf.spbau.ru/release2.2.0/spades-2.2.0.tar.gz
    tar -xzf spades-2.2.0.tar.gz
    cd spades-2.2.0

1.2 Getting SPAdes binaries

There are two ways to obtain SPAdes binaries: download static builds from our server or compile SPAdes on you server.

We recommend to download binaries with the following script:


    ./spades_download_binary.py

Also you can compile SPAdes yourself, but SPAdes depends on the following libraries for compiling its code:

If you meet this requirements you can build SPAdes with the following script:


    ./spades_compile.sh

In both cases you should get bin directory with files hammer (error correcting module) ans spades (assembly module).

1.3 Testing your installation

For testing purposes, SPAdes comes with a toy dataset (first 1000 bp of E. coli). If you run spades.py with the parameter --test


    ./spades.py --test

it will process this dataset and if the installation is successful you will see something like this at the end of the log:


 * Corrected reads are in spades_output/ECOLI_1K/corrected/
 * Assembled contigs are spades_output/ECOLI_1K/spades_04.18_17.59.30/ECOLI_1K.fasta

Thank you for using SPAdes!

======= SPAdes pipeline finished

2 Running SPAdes

2.1 Input data

SPAdes accepts single reads as well as forward-reverse paired end reads in FASTA and FASTQ format; however, in order to run error correction, reads should be in FASTQ format. All files may be compressed with gzip. At present, SPAdes can accept only one paired-end library as an input.

SPAdes supports paired end reads organized in two separate files or combined in one:

2.2 SPAdes pipeline

SPAdes automatically creates a spades_output directory where it stores all of the resulting files. For each dataset SPAdes creates a separate directory <project_name> in spades_output.

Before starting assembly SPAdes initiates BayesHammer to correct errors in reads. After that the corrected reads are stored in the directory spades_output/project_name/corrected in *.fastq.qz files.

After assembly completion, the resulting contigs are stored in the directory spades_output/project_name/contigs in <project_name>.fasta file. If the --generate-sam-file option was used, the symlink to the <project_name>.sam will also be created in the same directory.

If user runs SPAdes with the same project_name more than once, SPAdes finds the corrected reads and asks the user whether it should start error correction again. By default, we do not re-run error correction and we start the assembly in 10 seconds unless user says otherwise.

There are two ways to run SPAdes for you dataset: you can specify the parameters from the command line or provide the configuration file as the only ./spades.py parameter.

2.3 SPAdes command line options

To run SPAdes from the command line type


    ./spades.py [options] -n <project_name>

To run SPAdes on the toy dataset (see section 3) type


    ./spades.py -1 test_dataset/ecoli_1K_1.fq.gz -2 test_dataset/ecoli_1K_2.fq.gz -n ECOLI_1K

Here is the description of options:

-n <project_name>
    A required option that sets the name of the project.

-o <output_dir>
    Specify the output directory. Default: spades_output.

--sc
    This flag is required for MDA (single-cell) data.

--12 <filename>
    File with merged left and right paired end reads.

-1 <filename>
    File with left paired end reads.

-2 <filename>
    File with right paired end reads.

-s <filename>
    File with unpaired reads.

--generate-sam-file
    Generate a SAM file. See more details in section 3.

-t <int> (or --threads <int>)
    Number of threads. The default value is 16.

-m <int> (or --memory <int>)
    Sets the memory limit in Gb. SPAdes terminates if it reaches this limit. The default value is 250 Gb. Actually consumed physical memory will be smaller than this limit.

--tmp-dir <dirname>
    Sets the directory to store temporary files from error correction. The default value is <output_dir>/<project_name>/corrected/tmp.

-k <int,int,...>
    Comma-separated list of k-mer sizes to be used (all values must be odd, less than 100 and listed in the in ascending order). Default is 21,33,55.

-i <int> (or --iterations <int>)
    Number of iterations for error correction. The default value is 1.

--phred-offset <33 or 64>
    PHRED quality offset for the input reads, can be either 33 or 64. It will be auto-detected if it is not specified.

--only-error-correction
    Runs error correction only.

--only-assembler
    Runs assembly only.

--disable-gzip-output
    Forces error correction not to compress the corrected reads.

--test
    Runs SPAdes on the toy dataset, see section 1.3 .

--debug
    Runs SPAdes in debug mode, keeping intermediate results.

-h (or --help)
    Prints help.

2.4 Preparing configuration files

You can find the toy dataset configuration file spades_config.info (section 1.3) in the directory containing extracted archive (see section 1.1). This file can be used as the template for your datasets. Text placed after the first semicolon in this file is a comment.

The configuration file starts with the common parameters for the SPAdes run. We recommend setting the project_name parameter that specifies the directory and names for output files. Next is a dataset section that contains input reads information. This section is required and you need to specify either paired_reads or single_reads or both.

The next two sections are error_correction and assembly. If you want to skip one of these stages, you can remove or rename or comment it.

2.5 Supporting files and directories

For each run SPAdes creates the directory spades_<date_time> in spades_output/<project_name>. It contains internal configs, logs and intermediate results for different values of k. Also symlinks latest and latest_success are created for the the directories for the latest and the latest successful runs.

3 Postprocessing

After running SPAdes you can evaluate different quality metrics. Please see quality.html for details.

SPAdes generates a SAM file for further processing if the --generate-sam-file option in set the command line. SAM file contains information about the alignment of the original reads to the resulting contigs. We recommend using Tablet [3] to visualize these alignments.

Also we recommend to use SEQuel [4] after SPAdes to further reduce the number of small errors (single nucleotide substitutions and small indels). It also requires the SAM-file as an input.

Publicly available tools (for example, Opera [2]) can be used to build scaffolds as the current version of SPAdes is not yet providing this option.

4 Feedback and bug reports

Your comments, bug reports, and suggestions are very welcomed. They will help us to further improve SPAdes.

In case you have troubles running SPAdes, please provide us with the following files from <output_dir>/<project_name> directory:


    params.txt
    corrected/correction.log
    contigs/assembly.log

Address for communications: spades.support@bioinf.spbau.ru.

References

[1] A. Bankevich, S. Nurk, et al. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology. May 2012, 19(5): 455-477. doi:10.1089/cmb.2012.0021.

[2] Song Gao, Wing-Kin Sung, and Niranjan Nagarajan. Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. Journal of Computational Biology. November 2011, 18(11): 1681-1691. doi:10.1089/cmb.2011.0170.

[3] I. Milne, M. Bayer, L. Cardle, P. Shaw, G. Stephen, F.Wright, and D. Marshall. Tablet — next generation sequence assembly visualization. Bioinformatics (2010) 26 (3): 401-402. doi: 10.1093/bioinformatics/btp666

[4] R. Ronen, C. Boucher, H. Chitsaz, and P. Pevzner. SEQuel: Improving the accuracy of genome assemblies. Bioinformatics (2012) 28 (12): i188-i196. doi: 10.1093/bioinformatics/bts219