Assembly
Setup
Once your reads are clean, you’re ready to assemble. At the moment, you can use velvet, ABySS, and spades for assembly.
Most of the assembly process is automated using code within phyluce, specifically the following 3 scripts:
phyluce_assembly_assemblo_abyss
phyluce_assembly_assemblo_spades
phyluce_assembly_assemblo_velvet
The code of each of the above programs always expects your input directories to have the following structure (from the Quality Control section):
uce-clean/
genus_species1/
adapters.fasta
raw-reads/
genus_species1-READ1.fastq.gz (symlink)
genus_species1-READ2.fastq.gz (symlink)
split-adapter-quality-trimmed/
genus_species1-READ1.fastq.gz
genus_species1-READ2.fastq.gz
genus_species1-READ-singleton.fastq.gz
stats/
genus_species1-adapter-contam.txt
And, each of these assembly helper programs take the same configuration files as input. You should format the configuration file for input according to the following scheme:
[samples]
name_you_want_assembly_to_have:/path/to/uce-clean/genus_species1
In practice, this means you need to create a configuration file that looks like:
[samples]
anas_platyrhynchos1:/path/to/uce-clean/anas_platyrhynchos1
anas_carolinensis1:/path/to/uce-clean/anas_carolinensis1
dendrocygna_bicolor1:/path/to/uce-clean/dendrocygna_bicolor1
The assembly name on the left side of the colon can be whatever you want. The path name on the right hand side of the colon must be a valid path to a directory containing read data in a format similar to that described above.
Attention
Assembly names MUST be unique.
Question: How do I name my samples/assemblies?
Naming samples is a contentious issue and is also a hard thing to deal with using computer code. You should never have a problem if you name your samples as follows, where the genus and specific epithet are separated by an underscore, and multiple individuals of a given species are indicated using a trailing integer value:
anas_platyrhynchos1
anas_carolinensis1
dendrocygna_bicolor1
You should also not have problems if you use a naming scheme that suffixes the species binomial(s) with an accession number that is simply formatted (e.g. no slashes, dashes, etc.):
anas_platyrhynchos_KGH2267
anas_carolinensis_KGH2269
dendrocygna_bicolor_DWF4597
The above is the recommended working format. When you search for UCE contigs
phyluce should screen your taxon name to ensure they do not contain
restricted characters. This includes .+:"'-?!*@%^&#=/\
or names that begin
with a number. It’s probably best to get that all squared away now.
Running the assembly
Once your configuration file is created (best to use a decent text editor that will not cause you grief), you are ready to start assembling your read data into contigs that we will search for UCEs. The code to do this for the three helper scripts is below.
General process
The general process that the helper scripts use is:
Create the output directory (AKA $ASSEMBLY, below)
Create a
contigs
folder within the output directoryFor each taxon create
$ASSEMBLY/genus-species
directory, based on config file entriesFind the correct fastq files for a given sample
Input those fastq files to whichever assembly program
Assemble reads
Strip contigs of potentially problematic bases (ABySS-only)
Normalize contig names
Link all assembly files with normalized names in $ASSEMBLY/genus-species/ into $ASSEMBLY/contigs/genus-species.contigs.fasta, so that all assemblies are linked in the same output directory.
velvet
# make a directory for log files
mkdir log
# run the assembly
phyluce_assembly_assemblo_velvet \
--config config_file_you_created.conf \
--output /path/where/you/want/assemblies \
--kmer 35 \
--subfolder split-adapter-quality-trimmed \
--cores 12 \
--clean \
--log-path log
Results
The directory structure created for velvet-based assemblies looks like:
path-to-output-directory/
contigs/
genus-species1 -> ../genus-species1/out_k31/contigs.fa
genus-species1/
contigs.fasta -> out_k31/contigs.fa
out_k31
velvetg-k31.err.log
velvetg-k31.out.log
velveth-k31.err.log
velveth-k31.out.log
ABySS
# make a directory for log files
mkdir log
# run the assembly
phyluce_assembly_assemblo_abyss \
--config config_file_you_created.conf \
--output /path/where/you/want/assemblies \
--kmer 35 \
--subfolder split-adapter-quality-trimmed \
--cores 12 \
--clean \
--log-path log
Attention
Following assembly, phyluce_assembly_assemblo_abyss modifies the assemblies by replacing degenerate base codes with standard nucleotide encodings. We do this because lastz, which we use to match contigs to targeted UCE loci, is not compatible with degenerate IUPAC codes.
The phyluce_assembly_assemblo_abyss code makes these substitutions for every site having a
degenerate code by selecting the appropriate nucleotide encoding randomly.
The code also renames the ABySS assemblies using the velvet naming
convention. The modified contigs are them symlinked into
$ASSEMBLY/contigs
. Unmodified contigs are available in $ASSEMBLY/genus-
species/out_k*-contigs.fa
Results
The directory structure created for ABySS-based assemblies looks like:
path-to-output-directory/
contigs/
genus-species1 -> ../genus-species1/out_k31-contigs-velvet.fa
genus-species1/
abyss-k31.err.log
contigs.fasta -> out_k31-contigs-velvet.fa
out_k31-contigs.fa
out_k31-scaffolds.fa
out_k31-unitigs.fa
abyss-k31.out.log
coverage.hist
out_k31-contigs-velvet.fa
out_k31-stats
Spades
# make a directory for log files
mkdir log
# run the assembly
phyluce_assembly_assemblo_spades \
--config config_file_you_created.conf \
--output /path/where/you/want/assemblies \
--subfolder split-adapter-quality-trimmed \
--clean \
--cores 12 \
--log-path log
Results
The directory structure created for spades-based assemblies looks like:
path-to-output-directory/
contigs/
genus-species1 -> ../genus-species1/scaffolds.fasta
genus-species1/
contigs.fasta -> Trinity.fasta
Trinity.fasta
trinity.log
Common questions
Question: Which assembly program do I pick?
Generally, I would suggest that you use spades. It produces reasonable contig assemblies that are longer than the assemblies built by velvet, ABySS, or Trinity (now removed from phyluce). It arguable produces assemblies that are more accurate than assemblies from these other programs.
Question: For ABySS and velvet, what –kmer value do I use?
Also a hard question. Part of the reason that it is hard is due to the fact that we are trying to assemble data of heterogenous read depth (i.e., our reads are spread across (mostly) UCE loci, but the depth of coverage of each locus is varaible due to capture efficiency). Longer kmer values can give you longer (but fewer) contigs, while shorter kmer values produce fewer, more abundant contigs. In most cases, your assemblies will be decent with a kmer value around 55-65.