SPAdes 1.0 Manual

1. About truSPAdes
    1.1. TruSeq data
    1.2. TruSPAdes performance
2. Installation
    2.1. Downloading truSPAdes Linux binaries
    2.2. Downloading truSPAdes binaries for Mac
    2.3. Downloading and compiling truSPAdes source code
    2.4. Verifying your installation
3. Running truSPAdes
    3.1. TruSPAdes command line options
    3.2. TruSPAdes output
    3.3. Assembly evaluation
4. Feedback and bug reports

1. About truSPAdes

TruSPAdes is an assembler for short reads produced by Illumina TruSeq technology. TruSPAdes accepts as an input reads from single TruSeq barcodes and assembles long virtual reads. This manual will help you to install and run truSPAdes. TruSPAdes version 1.0 was released under GPLv2 on January 16, 2015 and can be downloaded from http://bioinf.spbau.ru/en/truspades. TruSPAdes is a modification of genome assembler SPAdes.

1.1 TruSeq data

TruSeq Synthetic Long Reads technology is based on fragmenting genomic DNA into large segments (about 10Kb long) and forming random pools of the resulting segments (each pool contains about 300 segments). Next, these fragments are clonally amplified, sheared, and marked with a unique barcode. Afterwards, they are sequenced using the standard Illumina short reads technology. All short reads originating from the same barcode are assembled together resulting in a set of long contigs (this step is called TruSeq barcode assembly). Ideally, the result of such sequencing effort for a single barcode is the collection of approximately 300 fragments (each fragment is about 10kb long) from a genome forming 300 long virtual reads. Together, these segments are expected to cover about 3 million nucleotides (barcode span). TruSPAdes is a tool for barcode assembly. Thus truSPAdes should be run separately on reads from each barcode.

1.2 TruSPAdes performance

TruSPAdes assembles average barcode (300000 reads) in 1 hour using one thread of Intel Xeon 2.27GHz processor and requires less than 2Gb RAM. TruSPAdes can be run in several threads to improve running time at the cost of increased memory consumption but we do not recommend it. Since truSPAdes should be run independantly on many barcodes we suggest running several instances of truSPAdes (on different barcodes) simultaneously each in one thread (or even on different nodes of cluster). This way standard TruSeq dataset with 394 barcodes can be processed in 24 hours using 16 threads and 32Gb RAM. Normally SPAdes assembles at least 2Mb virtual long reads from single barcode with N50 at least 7500.

2. Installation

TruSPAdes requires a 64-bit Linux system or Mac OS and Python (supported versions are 2.4, 2.5, 2.6, 2.7, 3.2 and 3.3) to be pre-installed on it. To obtain SPAdes you can either download binaries or download source code and compile it yourself.

2.1 Downloading truSPAdes Linux binaries

To download truSPAdes Linux binaries and extract them, go to the directory in which you wish truSPAdes to be installed and run:


    wget http://spades.bioinf.spbau.ru/truSPAdes-1.0/truSPAdes-1.0-Linux.tar.gz
    tar -xzf truSPAdes-1.0-Linux.tar.gz
    cd SPAdes-1.0-Linux/bin/

In this case you do not need to run any installation scripts – truSPAdes is ready to use. The following files will be placed in the bin directory:

We also suggest adding truSPAdes installation directory to the PATH variable.

2.2 Downloading truSPAdes binaries for Mac

To download truSPAdes binaries for Mac, go to the directory in which you wish truSPAdes to be installed and run:


    curl http://spades.bioinf.spbau.ru/truSPAdes-1.0/truSPAdes-1.0-Darwin.tar.gz -o truSPAdes-1.0-Darwin.tar.gz
    tar -zxf truSPAdes-1.0-Darwin.tar.gz
    cd truSPAdes-1.0-Darwin/bin/

Just as in Linux, SPAdes is ready to use and no further installation steps are required. You will get the same files in the bin directory:

We also suggest adding truSPAdes installation directory to the PATH variable.

2.3 Downloading and compiling truSPAdes source code

If you wish to compile truSPAdes by yourself you will need the following libraries to be pre-installed:

If you meet these requirements, you can download the truSPAdes source code:


    wget http://spades.bioinf.spbau.ru/truSPAdes-1.0/truSPAdes-1.0.tar.gz
    tar -xzf truSPAdes-1.0.tar.gz
    cd truSPAdes-1.0

and build it with the following script:


    ./truspades_compile.sh

TruSPAdes will be built in the directory ./bin. If you wish to install truSPAdes into another directory, you can specify full path of destination folder by running the following command in bash or sh:


    PREFIX=<destination_dir> ./truspades_compile.sh

for example:


    PREFIX=/usr/local ./truspades_compile.sh

which will install truSPAdes into /usr/local/bin.

After installation you will get the same files in ./bin (or <destination_dir>/bin if you specified PREFIX) directory:

We also suggest adding truSPAdes installation directory to the PATH variable.

2.4 Verifying your installation

For testing purposes, truSPAdes comes with a toy data set. To try SPAdes on this data set, run:


    <spades installation dir>/truspades.py --test

If you added truSPAdes installation directory to the PATH variable, you can run:


    truspades.py --test

For the simplicity we further assume that truSPAdes installation directory is added to the PATH variable.

If the installation is successful, you will find the following information at the end of the log:



 * Assembled long truseq reads are in truspades_test/truseq_long_reads.fasta
======= truSPAdes pipeline finished.
TruSPAdes log can be found here: <your folder>/spades_test/truspades.log
Thank you for using truSPAdes!


3. Running truSPAdes

3.1 TruSPAdes command line options

To run truSPAdes from the command line, type


    spades.py [options] -1 <left_reads_file> -2 <right_reads_file> -o <output_dir>

Note that we assume that truSPAdes installation directory is added to the PATH variable (provide full path to truSPAdes executable otherwise: <truspades installation dir>/truspades.py).

Basic options

-o <output_dir>
    Specify the output directory. Required option.

--test
    Runs SPAdes on the toy data set; see section 2.3.

-h (or --help)
    Prints help.

--continue
    Continues truSPAdes run from the specified output folder starting from the last available check-point. Check-points are made after:

For example, if specified K values are 21, 33 and 55 and SPAdes was stopped or crashed during assembly stage with K = 55, you can run SPAdes with the --continue option specifying the same output directory. SPAdes will continue the run starting from the assembly stage with K = 55. Error correction module and iterations for K equal to 21 and 33 will not be run again. Note that all options except -o <output_dir> are ignored if --continue is set.

--restart-from <check_point>
    Restart truSPAdes run from the specified output folder starting from the specified check-point. Check-points are:

In comparison to the --continue option, you can change some of the options when using --restart-from. You can change any option except all basic options. For example, if you ran assembler with k values 21,33,45,55, you can add one more iteration with k=77 with following options:
--restart-from k55 -k 21,33,55,77 -o <previous_output_dir>.
Since all files will be overwritten, do not forget to copy your assembly from the previous run if you need it.

Input data

-1 <file_name>
    File with forward reads.

-2 <file_name>
    File with reverse reads.

Advanced options

-t <int> (or --threads <int>)
    Number of threads. The default value is 1.

-m <int> (or --memory <int>)
    Set memory limit in Gb. TruSPAdes terminates if it reaches this limit. The default value is 250 Gb. Actual amount of consumed RAM will be below this limit. Make sure this value is correct for the given machine. TruSPAdes uses the limit value to automatically determine the sizes of various buffers, etc.

--tmp-dir <dir_name>
    Set directory for temporary files from read error correction. The default value is <output_dir>/corrected/tmp

-k <int,int,...>
    Comma-separated list of k-mer sizes to be used (all values must be odd, less than 128 and listed in ascending order). The default value depends on the read length. E.g. for reads of length 100 <21,33,45,55 is used.

Examples

To test the toy data set, you can also run the following command from the SPAdes bin directory:


    spades.py -1 ../share/truspades/test_dataset/ecoli_1K_1.fq.gz \
    -2 ../share/truspades/test_dataset/ecoli_1K_2.fq.gz -o truspades_test

If you want to use specific values of k:


    spades.py -1 left_reads.fastq -2 right_reads.fastq -k 21,31,41,51,61 -o output_folder

3.2 TruSPAdes output

SPAdes stores all output files in <output_dir> , which is set by the user. Resulting TruSeq long reads will be stored in <output_dir>/truseq_long_reads.fasta The full list of <output_dir> content is presented below:

    truseq_long_reads.fastaresulting truseq long reads
    params.txtinformation about truSPAdes parameters in this run
    truspades.logtruSPAdes log
    dataset.infointernal configuration file
    input_dataset.yamlinternal YAML data set file
    K<##>/directory containing files from the run with K=<##> 
    SCCdirectory containing files from the scaffold correction stage 
    alignment/directory containing alignment files used in postprocessing stage 
    miscdirectory containing scaffolds before postprocessing 
    correcteddirectory containing internal information 

TruSPAdes will overwrite these files and directories if they exist in the specified <output_dir>.

3.3 Assembly evaluation

QUAST may be used to generate summary statistics (N50, maximum contig length, GC %, # genes found in a reference list or with built-in gene finding tools, etc.) for a single assembly. It may also be used to compare statistics for multiple assemblies of the same data set (e.g., truSPAdes run with different parameters, or several different assemblers).

4. Feedback and bug reports

Your comments, bug reports, and suggestions are very welcomed. They will help us to further improve truSPAdes.

If you have any troubles running truSPAdes, please send us params.txt and truspades.log from the directory <output_dir>.

Address for communications: spades.support@bioinf.spbau.ru.