Assembly

Setup

Once your reads are clean, you’re ready to assemble. At the moment, you can use velvet, ABySS, and Trinity for assembly. However, be aware that there is not a conda install package for Trinity due to some difficulties in how that package is structured.

Most of the assembly process is automated using code within phyluce, specifically the following 3 scripts:

  • phyluce_assembly_assemblo_abyss
  • phyluce_assembly_assemblo_trinity
  • phyluce_assembly_assemblo_velvet

The code of each of the above programs always expects your input directories to have the following structure (from the Quality Control section):

uce-clean/
    genus_species1/
        adapters.fasta
        raw-reads/
            genus_species1-READ1.fastq.gz (symlink)
            genus_species1-READ2.fastq.gz (symlink)
        split-adapter-quality-trimmed/
            genus_species1-READ1.fastq.gz
            genus_species1-READ2.fastq.gz
            genus_species1-READ-singleton.fastq.gz
        stats/
            genus_species1-adapter-contam.txt

And, each of these assembly helper programs take the same configuration files as input. You should format the configuration file for input according to the following scheme:

[samples]
name_you_want_assembly_to_have:/path/to/uce-clean/genus_species1

In practice, this means you need to create a configuration file that looks like:

[samples]
anas_platyrhynchos1:/path/to/uce-clean/anas_platyrhynchos1
anas_carolinensis1:/path/to/uce-clean/anas_carolinensis1
dendrocygna_bicolor1:/path/to/uce-clean/dendrocygna_bicolor1

The assembly name on the left side of the colon can be whatever you want. The path name on the right hand side of the colon must be a valid path to a directory containing read data in a format similar to that described above.

Attention

Assembly names MUST be unique.

Question: How do I name my samples/assemblies?

Naming samples is a contentious issue and is also a hard thing to deal with using computer code. You should never have a problem if you name your samples as follows, where the genus and specific epithet are separated by an underscore, and multiple individuals of a given species are indicated using a trailing integer value:

anas_platyrhynchos1
anas_carolinensis1
dendrocygna_bicolor1

You should also not have problems if you use a naming scheme that suffixes the species binomial(s) with an accession number that is simply formatted (e.g. no slashes, dashes, etc.):

anas_platyrhynchos_KGH2267
anas_carolinensis_KGH2269
dendrocygna_bicolor_DWF4597

Running the assembly

Once your configuration file is created (best to use a decent text editor) that will not cause you grief, you are ready to start assembling your read data into contigs that we will search for UCEs. The code to do this for the three helper scripts is below (remember, we are using $HOME/anaconda/bin generically to refer to your anaconda or miniconda install).

General process

The general process that the helper scripts use is:

  1. Create the output directory (AKA $ASSEMBLY, below)
  2. Create a contigs folder within the output directory
  3. For each taxon create $ASSEMBLY/genus-species directory, based on config file entries
  4. Find the correct fastq files for a given sample
  5. Input those fastq files to whichever assembly program
  6. Assemble reads using user-defined/static --kmer value
  7. Strip contigs of potentially problematic bases (ABySS-only)
  8. Normalize contig names
  9. Link assembly file with normalized names in $ASSEMBLY/genus-species/ into $ASSEMBLY/contigs/genus-species

velvet

# make a directory for log files
mkdir log
# run the assembly
phyluce_assembly_assemblo_velvet \
    --config config_file_you_created.conf \
    --output /path/where/you/want/assemblies \
    --kmer 35 \
    --subfolder split-adapter-quality-trimmed \
    --cores 12 \
    --clean \
    --log-path log

Results

The directory structure created for velvet-based assemblies looks like:

path-to-output-directory/
    contigs/
        genus-species1 -> ../genus-species1/out_k31/contigs.fa
    genus-species1/
        contigs.fasta -> out_k31/contigs.fa
        out_k31
        velvetg-k31.err.log
        velvetg-k31.out.log
        velveth-k31.err.log
        velveth-k31.out.log

ABySS

# make a directory for log files
mkdir log
# run the assembly
phyluce_assembly_assemblo_abyss \
    --config config_file_you_created.conf \
    --output /path/where/you/want/assemblies \
    --kmer 35 \
    --subfolder split-adapter-quality-trimmed \
    --cores 12 \
    --clean \
    --log-path log

Attention

Following assembly, phyluce_assembly_assemblo_abyss modifies the assemblies by replacing degenerate base codes with standard nucleotide encodings. We do this because lastz, which we use to match contigs to targeted UCE loci, is not compatible with degenerate IUPAC codes.

The phyluce_assembly_assemblo_abyss code makes these substitutions for every site having a degenerate code by selecting the appropriate nucleotide encoding randomly. The code also renames the ABySS assemblies using the velvet naming convention. The modified contigs are them symlinked into $ASSEMBLY/contigs. Unmodified contigs are available in $ASSEMBLY/genus- species/out_k*-contigs.fa

Results

The directory structure created for ABySS-based assemblies looks like:

path-to-output-directory/
    contigs/
        genus-species1 -> ../genus-species1/out_k31-contigs-velvet.fa
    genus-species1/
        abyss-k31.err.log
        contigs.fasta -> out_k31-contigs-velvet.fa
        out_k31-contigs.fa
        out_k31-scaffolds.fa
        out_k31-unitigs.fa
        abyss-k31.out.log
        coverage.hist
        out_k31-contigs-velvet.fa
        out_k31-stats

Trinity

# make a directory for log files
mkdir log
# run the assembly
phyluce_assembly_assemblo_trinity \
    --config config_file_you_created.conf \
    --output /path/where/you/want/assemblies \
    --subfolder split-adapter-quality-trimmed \
    --clean \
    --cores 12 \
    --log-path log

Question: Why do I not use the –kmer flag with phyluce_assembly_assemblo_trinity?

Trinity assembles using a “static” or “program-defined” (i.e., not user- defined) kmer value of 25.

Question: What is the –clean option?

The –clean option removes extraneous temporary files that Trinity creates during the assembly process. This results in a vastly smaller assemblies directory.

Results

The directory structure created for Trinity-based assemblies looks like:

path-to-output-directory/
    contigs/
        genus-species1 -> ../genus-species1/Trinity.fasta
    genus-species1/
        contigs.fasta -> Trinity.fasta
        Trinity.fasta
        trinity.log

Common questions

Question: Which assembly program do I pick?

Generally, I would suggest that you use Trinity. In my hands, it produces reasonable contig assemblies that are longer than the assemblies built by either velvet or ABySS. There are some caveats, however. If you want the most accurate assemblies possibly, then it may be best to use ABySS. This is because ABySS runs read-based error correction prior to assembly which results in more accurate contigs.

Question: For ABySS and velvet, what –kmer value do I use?

Also a hard question. Part of the reason that it is hard is due to the fact that we are trying to assemble data of heterogenous read depth (i.e., our reads are spread across (mostly) UCE loci, but the depth of coverage of each locus is varaible due to capture efficiency). Longer kmer values can give you longer (but fewer) contigs, while shorter kmer values produce fewer, more abundant contigs. In most cases, your assemblies will be decent with a kmer value around 55-65.