SPAdes stands for St. Petersburg genome assembler. It is intended for both standard (multicell) and single-cell MDA bacteria assemblies. This manual will help you to install and run SPAdes. You can find the latest SPAdes release at http://bioinf.spbau.ru/spades. The latest version of this manual can be found here. SPAdes version 2.2.1 released under GPLv2 on 20 August 2012.
SPAdes requires a 64-bit Linux system. We have two test datasets: single-cell E. coli dataset and multicell E. coli dataset. SPAdes requires 35 Gb of RAM for processing these datasets. Also SPAdes requires Python 2.4 or higher installed.
To download SPAdes tar.gz and extract it:
wget http://spades.bioinf.spbau.ru/release2.2.1/spades-2.2.1.tar.gz tar -xzf spades-2.2.1.tar.gz cd spades-2.2.1
There are two ways to obtain SPAdes binaries: download static builds from our server or compile SPAdes on you server.
We recommend to download binaries with the following script:
Also you can compile SPAdes yourself, but SPAdes depends on the following libraries for compiling its code:
If you meet this requirements you can build SPAdes with the following script:
In both cases you should get
bin directory with files
hammer (error correcting module) ans
spades (assembly module).
For testing purposes, SPAdes comes with a toy dataset (first 1000 bp of E. coli). If you run
spades.py with the parameter
it will process this dataset and if the installation is successful you will see something like this at the end of the log:
* Corrected reads are in spades_output/ECOLI_1K/corrected/ * Assembled contigs are spades_output/ECOLI_1K/spades_04.18_17.59.30/ECOLI_1K.fasta Thank you for using SPAdes! ======= SPAdes pipeline finished
SPAdes accepts single reads as well as forward-reverse paired end reads in FASTA and FASTQ format; however, in order to run error correction, reads should be in FASTQ format. All files may be compressed with gzip. At present, SPAdes can accept only one paired-end library as an input.
SPAdes supports paired end reads organized in two separate files or combined in one:
s_1_2_sequence.txtfor the forward and reverse reads, respectively, from lane 1), the second file should have corresponding reads in the exact same order.
s_1_1_sequence.txtis followed by the corresponding read from
SPAdes automatically creates a
spades_output directory where it stores all of the resulting files. For each dataset SPAdes creates
a separate directory
Before starting assembly SPAdes initiates BayesHammer to correct errors in reads.
After that the corrected reads are stored in the directory
After assembly completion, the resulting contigs are stored in the directory
--generate-sam-file option was used, the symlink to the
<project_name>.sam will also be created in the same directory.
If user runs SPAdes with the same
project_name more than once, SPAdes finds the corrected reads and asks the user whether
it should start error correction again. By default, we do not re-run error correction and we start the assembly in 10 seconds
unless user says otherwise.
There are two ways to run SPAdes for you dataset: you can specify the parameters from the command line or provide the configuration
file as the only
To run SPAdes from the command line type
./spades.py [options] -n <project_name>
To run SPAdes on the toy dataset (see section 1.3) type
./spades.py -1 test_dataset/ecoli_1K_1.fq.gz -2 test_dataset/ecoli_1K_2.fq.gz -n ECOLI_1K
Here is the description of options:
A required option that sets the name of the project.
Specify the output directory. Default:
This flag is required for MDA (single-cell) data.
File with merged left and right paired end reads.
File with left paired end reads.
File with right paired end reads.
File with unpaired reads.
Generate a SAM file. See more details in section 3.
-t <int> (or
Number of threads. The default value is 16.
-m <int> (or
Sets the memory limit in Gb. SPAdes terminates if it reaches this limit. The default value is 250 Gb. Actually consumed physical memory will be smaller than this limit.
Sets the directory to store temporary files from error correction. The default value is
Comma-separated list of k-mer sizes to be used (all values must be odd, less than 100 and listed in the in ascending order). Default is 21,33,55.
-i <int> (or
Number of iterations for error correction. The default value is 1.
--phred-offset <33 or 64>
PHRED quality offset for the input reads, can be either 33 or 64. It will be auto-detected if it is not specified.
Runs error correction only.
Runs assembly only.
Forces error correction not to compress the corrected reads.
Runs SPAdes on the toy dataset, see section 1.3 .
Runs SPAdes in debug mode, keeping intermediate results.
You can find the toy dataset configuration file
spades_config.info (section 1.3)
in the directory containing extracted archive (see section 1.1). This file can be used as the template for your datasets.
Text placed after the first semicolon in this file is a comment.
The configuration file starts with the common parameters for the SPAdes run. We recommend setting the
project_name parameter that specifies the directory
and names for output files. Next is a dataset section that contains input reads information. This section is required and you need to specify
single_reads or both.
The next two sections are
assembly. If you want to skip one of these stages, you can
remove or rename or comment it.
For each run SPAdes creates the directory
It contains internal configs, logs and intermediate results for different values of k.
latest_success are created for the the directories
for the latest and the latest successful runs.
After running SPAdes you can evaluate different quality metrics. Please see quality.html for details.
SPAdes generates a SAM file for further processing if the
--generate-sam-file option in set the command line.
SAM file contains information about the alignment of the original reads to the resulting
contigs. We recommend using Tablet  to visualize these alignments.
Also we recommend to use SEQuel  after SPAdes to further reduce the number of small errors (single nucleotide substitutions and small indels). It also requires the SAM-file as an input.
Publicly available tools (for example, Opera ) can be used to build scaffolds as the current version of SPAdes is not yet providing this option.
Your comments, bug reports, and suggestions are very welcomed. They will help us to further improve SPAdes.
In case you have troubles running SPAdes, please provide us with the following files from
params.txt corrected/correction.log contigs/assembly.log
Address for communications: firstname.lastname@example.org.
 A. Bankevich, S. Nurk, et al. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology. May 2012, 19(5): 455-477. doi:10.1089/cmb.2012.0021.
 Song Gao, Wing-Kin Sung, and Niranjan Nagarajan. Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. Journal of Computational Biology. November 2011, 18(11): 1681-1691. doi:10.1089/cmb.2011.0170.
 I. Milne, M. Bayer, L. Cardle, P. Shaw, G. Stephen, F.Wright, and D. Marshall. Tablet — next generation sequence assembly visualization. Bioinformatics (2010) 26 (3): 401-402. doi: 10.1093/bioinformatics/btp666
 R. Ronen, C. Boucher, H. Chitsaz, and P. Pevzner. SEQuel: Improving the accuracy of genome assemblies. Bioinformatics (2012) 28 (12): i188-i196. doi: 10.1093/bioinformatics/bts219