Once your reads are clean, you’re ready to assemble. At the moment, you can use velvet, ABySS, and Trinity for assembly. However, be aware that there is not a conda install package for Trinity due to some difficulties in how that package is structured.
Most of the assembly process is automated using code within phyluce, specifically the following 3 scripts:
The code of each of the above programs always expects your input directories to have the following structure (from the Quality Control section):
uce-clean/ genus_species1/ adapters.fasta raw-reads/ genus_species1-READ1.fastq.gz (symlink) genus_species1-READ2.fastq.gz (symlink) split-adapter-quality-trimmed/ genus_species1-READ1.fastq.gz genus_species1-READ2.fastq.gz genus_species1-READ-singleton.fastq.gz stats/ genus_species1-adapter-contam.txt
And, each of these assembly helper programs take the same configuration files as input. You should format the configuration file for input according to the following scheme:
In practice, this means you need to create a configuration file that looks like:
[samples] anas_platyrhynchos1:/path/to/uce-clean/anas_platyrhynchos1 anas_carolinensis1:/path/to/uce-clean/anas_carolinensis1 dendrocygna_bicolor1:/path/to/uce-clean/dendrocygna_bicolor1
The assembly name on the left side of the colon can be whatever you want. The path name on the right hand side of the colon must be a valid path to a directory containing read data in a format similar to that described above.
Assembly names MUST be unique.
Question: How do I name my samples/assemblies?
Naming samples is a contentious issue and is also a hard thing to deal with using computer code. You should never have a problem if you name your samples as follows, where the genus and specific epithet are separated by an underscore, and multiple individuals of a given species are indicated using a trailing integer value:
anas_platyrhynchos1 anas_carolinensis1 dendrocygna_bicolor1
You should also not have problems if you use a naming scheme that suffixes the species binomial(s) with an accession number that is simply formatted (e.g. no slashes, dashes, etc.):
anas_platyrhynchos_KGH2267 anas_carolinensis_KGH2269 dendrocygna_bicolor_DWF4597
Running the assembly¶
Once your configuration file is created (best to use a decent text editor) that will not cause you grief, you are ready to start assembling your read data into contigs that we will search for UCEs. The code to do this for the three helper scripts is below (remember, we are using $HOME/anaconda/bin generically to refer to your anaconda or miniconda install).
The general process that the helper scripts use is:
- Create the output directory (AKA $ASSEMBLY, below)
- Create a
contigsfolder within the output directory
- For each taxon create
$ASSEMBLY/genus-speciesdirectory, based on config file entries
- Find the correct fastq files for a given sample
- Input those fastq files to whichever assembly program
- Assemble reads using user-defined/static
- Strip contigs of potentially problematic bases (ABySS-only)
- Normalize contig names
- Link assembly file with normalized names in $ASSEMBLY/genus-species/ into $ASSEMBLY/contigs/genus-species
# make a directory for log files mkdir log # run the assembly phyluce_assembly_assemblo_velvet \ --config config_file_you_created.conf \ --output /path/where/you/want/assemblies \ --kmer 35 \ --subfolder split-adapter-quality-trimmed \ --cores 12 \ --clean \ --log-path log
The directory structure created for velvet-based assemblies looks like:
path-to-output-directory/ contigs/ genus-species1 -> ../genus-species1/out_k31/contigs.fa genus-species1/ contigs.fasta -> out_k31/contigs.fa out_k31 velvetg-k31.err.log velvetg-k31.out.log velveth-k31.err.log velveth-k31.out.log
# make a directory for log files mkdir log # run the assembly phyluce_assembly_assemblo_abyss \ --config config_file_you_created.conf \ --output /path/where/you/want/assemblies \ --kmer 35 \ --subfolder split-adapter-quality-trimmed \ --cores 12 \ --clean \ --log-path log
Following assembly, phyluce_assembly_assemblo_abyss modifies the assemblies by replacing degenerate base codes with standard nucleotide encodings. We do this because lastz, which we use to match contigs to targeted UCE loci, is not compatible with degenerate IUPAC codes.
The phyluce_assembly_assemblo_abyss code makes these substitutions for every site having a degenerate code by selecting the appropriate nucleotide encoding randomly. The code also renames the ABySS assemblies using the velvet naming convention. The modified contigs are them symlinked into $ASSEMBLY/contigs. Unmodified contigs are available in $ASSEMBLY/genus- species/out_k*-contigs.fa
The directory structure created for ABySS-based assemblies looks like:
path-to-output-directory/ contigs/ genus-species1 -> ../genus-species1/out_k31-contigs-velvet.fa genus-species1/ abyss-k31.err.log contigs.fasta -> out_k31-contigs-velvet.fa out_k31-contigs.fa out_k31-scaffolds.fa out_k31-unitigs.fa abyss-k31.out.log coverage.hist out_k31-contigs-velvet.fa out_k31-stats
# make a directory for log files mkdir log # run the assembly phyluce_assembly_assemblo_trinity \ --config config_file_you_created.conf \ --output /path/where/you/want/assemblies \ --subfolder split-adapter-quality-trimmed \ --clean \ --cores 12 \ --log-path log
Question: Why do I not use the –kmer flag with phyluce_assembly_assemblo_trinity?
Trinity assembles using a “static” or “program-defined” (i.e., not user- defined) kmer value of 25.
Question: What is the –clean option?
The –clean option removes extraneous temporary files that Trinity creates during the assembly process. This results in a vastly smaller assemblies directory.
Question: Which assembly program do I pick?
Generally, I would suggest that you use Trinity. In my hands, it produces reasonable contig assemblies that are longer than the assemblies built by either velvet or ABySS. There are some caveats, however. If you want the most accurate assemblies possibly, then it may be best to use ABySS. This is because ABySS runs read-based error correction prior to assembly which results in more accurate contigs.
Question: For ABySS and velvet, what –kmer value do I use?
Also a hard question. Part of the reason that it is hard is due to the fact that we are trying to assemble data of heterogenous read depth (i.e., our reads are spread across (mostly) UCE loci, but the depth of coverage of each locus is varaible due to capture efficiency). Longer kmer values can give you longer (but fewer) contigs, while shorter kmer values produce fewer, more abundant contigs. In most cases, your assemblies will be decent with a kmer value around 55-65.