Workflows

As of phyluce 1.7.0, there is new functionality that uses “workflows” to perform different actions. Key among these are things like computing coverage across UCE loci and phasing SNPs within UCE loci. These workflows use Snakemake, internally, and they are pretty easily portable and/or easy to modify, if desired.

What’s Different

Previously, phyluce used its own, internal pipeline code to run multi-step, bioinformatic workflows. These have now been moved into “workflows”, which accomplish the same general steps but are much easier to maintain and run using Snakemake.

Workflow Location

The workflow Snakemake files should be packaged into your conda installation, in case you are interested in modifying them for any reason. To most easily find their location, activate the phyluce environment, then run:

# get location of python in our conda environment
which python

# this returns something like:
/Users/bcf/miniconda3/envs/phyluce/bin/python

This means that the workflow Snakemake files will be located at /Users/bcf/miniconda3/envs/phyluce/phyluce/workflows. Individual workflows can be run directly by Snakemake from this directory, or they can be copied elsewhere, modified, and run by Snakemake. You can also run these workflows within phyluce (see below).

Workflow Configuration

The workflow configuration files are detailed below, but it’s important to note that they use a different configuration format than other phyluce configuration files. Instead of Windows INI based format, the new workflows (and Snakemake, in general) use YAML syntax. See examples below.

Different Workflows

Mapping

Right now, the “mapping” workflow precedes all other workflows and is responsible for mapping reads to contigs, marking duplicates, computing coverage, and outputting BAM files representing the mapped reads. In order to run this new workflow, create a YAML-formatted configuration file that contains the names and paths (relative or absolute) to your contigs and your trimmed reads:

reads:
    alligator-mississippiensis: ../../phyluce/tests/test-expected/raw-reads/alligator-mississippiensis/
    gallus-gallus: ../../phyluce/tests/test-expected/raw-reads/gallus-gallus
    peromyscus-maniculatus: ../../phyluce/tests/test-expected/raw-reads/peromyscus-maniculatus
    rana-sphenocephafa: ../../phyluce/tests/test-expected/raw-reads/rana-sphenocephafa

contigs:
    alligator-mississippiensis: ../../phyluce/tests/test-expected/spades/contigs/alligator_mississippiensis.contigs.fasta
    gallus-gallus: ../../phyluce/tests/test-expected/spades/contigs/gallus_gallus.contigs.fasta
    peromyscus-maniculatus: ../../phyluce/tests/test-expected/spades/contigs/peromyscus_maniculatus.contigs.fasta
    rana-sphenocephafa: ../../phyluce/tests/test-expected/spades/contigs/rana_sphenocephafa.contigs.fasta

The first section of the file gives the name and path to a folder of raw-reads for each sample (this folder is what results from illumiprocessor). The second section gives the name and path to the contigs assembled for each organism.

To map these reads to the assembled contigs, run:

phyluce_workflow --config <path to your config file> \
    --output <path to some output folder name> \
    --workflow mapping \
    --cores 1

This will run the workflow, and your results will end up in the output folder specified. The structure of the output folder will look something like the following:

.
├── coverage
│   ├── all-taxon.summary.csv
│   ├── alligator-mississippiensis.samtools.cov.tdt
│   ├── alligator-mississippiensis.summary.csv
│   ├── gallus-gallus.samtools.cov.tdt
│   ├── gallus-gallus.summary.csv
│   ├── peromyscus-maniculatus.samtools.cov.tdt
│   ├── peromyscus-maniculatus.summary.csv
│   ├── rana-sphenocephafa.samtools.cov.tdt
│   └── rana-sphenocephafa.summary.csv
├── mapped_reads
│   ├── alligator-mississippiensis.fxm.sorted.md.bam
│   ├── alligator-mississippiensis.fxm.sorted.md.bam.flagstats.txt
│   ├── gallus-gallus.fxm.sorted.md.bam
│   ├── gallus-gallus.fxm.sorted.md.bam.flagstats.txt
│   ├── peromyscus-maniculatus.fxm.sorted.md.bam
│   ├── peromyscus-maniculatus.fxm.sorted.md.bam.flagstats.txt
│   ├── rana-sphenocephafa.fxm.sorted.md.bam
│   └── rana-sphenocephafa.fxm.sorted.md.bam.flagstats.txt
└── references
    ├── alligator-mississippiensis.contigs.fasta
    ├── alligator-mississippiensis.contigs.fasta.amb
    ├── alligator-mississippiensis.contigs.fasta.ann
    ├── alligator-mississippiensis.contigs.fasta.bwt
    ├── alligator-mississippiensis.contigs.fasta.pac
    ├── alligator-mississippiensis.contigs.fasta.sa
    ├── gallus-gallus.contigs.fasta
    ├── gallus-gallus.contigs.fasta.amb
    ├── gallus-gallus.contigs.fasta.ann
    ├── gallus-gallus.contigs.fasta.bwt
    ├── gallus-gallus.contigs.fasta.pac
    ├── gallus-gallus.contigs.fasta.sa
    ├── peromyscus-maniculatus.contigs.fasta
    ├── peromyscus-maniculatus.contigs.fasta.amb
    ├── peromyscus-maniculatus.contigs.fasta.ann
    ├── peromyscus-maniculatus.contigs.fasta.bwt
    ├── peromyscus-maniculatus.contigs.fasta.pac
    ├── peromyscus-maniculatus.contigs.fasta.sa
    ├── rana-sphenocephafa.contigs.fasta
    ├── rana-sphenocephafa.contigs.fasta.amb
    ├── rana-sphenocephafa.contigs.fasta.ann
    ├── rana-sphenocephafa.contigs.fasta.bwt
    ├── rana-sphenocephafa.contigs.fasta.pac
    └── rana-sphenocephafa.contigs.fasta.sa

Within the coverage directory are outputs on a per-sample and overall basis. For example, alligator-mississippiensis.summary.csv will contain summary info on coverage for the alligator-mississippiensis contigs - one line for each contig. Overall summary statistics (by taxon) will be in all-taxon.summary.csv. BAM files resulting from the mapping are in the mapped-reads directory, along with the output of samtools flagstats for each BAM. The references directory contains the FASTA-formatted contigs you started with and their bwa indexes.

Attention

If you want to compute coverage on UCE contigs (only) versus all contigs that were assembled, run the probe/bait to contig matching, create a monolithic FASTA for whatever samples you want, explode that FASTA --by-taxon, then use the path to those files for each taxon in the contig section of the workflow config file, described above.

You can also perform a dry-run of the software by adding the --dry-run parameter, like so:

phyluce_workflow --config <path to your config file> \
    --output <path to some output folder name> \
    --workflow mapping \
    --cores 1 \
    --dry-run

This will show you what should happen, without performing the analysis. Log files from the Snakemake run will be present in a hidden directory in your output folder named .snakemake. Like so:

.
├── .snakemake
│   ├── auxiliary
│   ├── conda
│   ├── conda-archive
│   ├── incomplete
│   ├── locks
│   ├── log
│      └── 2021-03-01T150829.811458.snakemake.log
│   ├── metadata
│   ├── scripts
│   ├── shadow
│   └── singularity
├── coverage
├── mapped_reads
└── references

Phasing

The phasing workflow is a re-implementation of the approach that we used in Andermann et al. 2018 that uses mapping information (generated above), along with samtools and pilon_ to output the phased contigs. The goal of reimplmentation was to make this pipeline more robust. You run the pipeline by (1) running the mapping workflow, above. Then, (2) you create a second configuration file that looks like the following:

bams:
    alligator-mississippiensis: ../tests/test-data/bams/alligator-mississippiensis.fxm.sorted.md.bam
    gallus-gallus: ../tests/test-data/bams/gallus-gallus.fxm.sorted.md.bam
    peromyscus-maniculatus: ../tests/test-data/bams/peromyscus-maniculatus.fxm.sorted.md.bam
    rana-sphenocephafa: ../tests/test-data/bams/rana-sphenocephafa.fxm.sorted.md.bam

contigs:
    alligator-mississippiensis: ../tests/test-data/contigs/alligator_mississippiensis.contigs.fasta
    gallus-gallus: ../tests/test-data/contigs/gallus_gallus.contigs.fasta
    peromyscus-maniculatus: ../tests/test-data/contigs/peromyscus_maniculatus.contigs.fasta
    rana-sphenocephafa: ../tests/test-data/contigs/rana_sphenocephafa.contigs.fasta

This contains a section pointing to the location of the BAM files created during mapping, and you can copy over the contigs section of the mapping config file. Finally, (3) you run the workflow with:

phyluce_workflow --config <path to your config file> \
    --output <path to some output folder name> \
    --workflow mapping \
    --cores 1

This produces a folder of output containing BAMs and FASTAs for each haplotye that looks like the following (here, only showing the results for gallus-gallus versus all 4 taxa in the configuration file:

.
├── bams
│   ├── gallus-gallus.0.bam
│   ├── gallus-gallus.0.bam.bai
│   ├── gallus-gallus.1.bam
│   ├── gallus-gallus.1.bam.bai
│   └── gallus-gallus.chimera.bam
└── fastas
    ├── gallus-gallus.0.changes
    ├── gallus-gallus.0.fasta
    ├── gallus-gallus.0.vcf
    ├── gallus-gallus.1.changes
    ├── gallus-gallus.1.fasta
    └── gallus-gallus.1.vcf

Right now, what you do with these files is left up to you (e.g. in terms of merging their contents and getting the data aligned). You can essentially group all the *.0.fasta and *.1.fasta files for all taxa together as new “assemblies” of data and start the phyluce analysis process over from phyluce_assembly_match_contigs_to_probes.

Correction

This is a new workflow that we’ve put together that helps account for sequencing depth and base-calling quality in assembled contigs. Essentially, you can think of this “correction” process as a filter that helps remove low-depth, low-quality base calls from your assembly data generated by phyluce. We are using this, in particular, with UCE data collected from toepads.

To run the workflow, (1) first run the mapping workflow above and (2) create a configuration file that looks like:

bams:
    alligator-mississippiensis: ../tests/test-data/bams/alligator-mississippiensis.fxm.sorted.md.bam
    gallus-gallus: ../tests/test-data/bams/gallus-gallus.fxm.sorted.md.bam
    peromyscus-maniculatus: ../tests/test-data/bams/peromyscus-maniculatus.fxm.sorted.md.bam
    rana-sphenocephafa: ../tests/test-data/bams/rana-sphenocephafa.fxm.sorted.md.bam

contigs:
    alligator-mississippiensis: ../tests/test-data/contigs/alligator_mississippiensis.contigs.fasta
    gallus-gallus: ../tests/test-data/contigs/gallus_gallus.contigs.fasta
    peromyscus-maniculatus: ../tests/test-data/contigs/peromyscus_maniculatus.contigs.fasta
    rana-sphenocephafa: ../tests/test-data/contigs/rana_sphenocephafa.contigs.fasta

This contains a section pointing to the location of the BAM files created during mapping, and you can copy over the contigs section of the mapping config file. Finally, (3) you run the workflow with:

phyluce_workflow --config <path to your config file> \
    --output <path to some output folder name> \
    --workflow correction \
    --cores 1

This produces a folder of output that looks like the following. Within this directory as a set of “consensus” contigs, where variant bases have been hard-masked that have QUAL<20 | DP<5 | AN>2:

.
├── consensus
│   ├── alligator-mississippiensis.consensus.filt.fasta
│   ├── gallus-gallus.consensus.filt.fasta
│   ├── peromyscus-maniculatus.consensus.filt.fasta
│   └── rana-sphenocephafa.consensus.filt.fasta
└── filtered_norm_pileups
    ├── alligator-mississippiensis.norm.flt-indels.Q20.DP10.bcf
    ├── alligator-mississippiensis.norm.flt-indels.Q20.DP10.bcf.csi
    ├── gallus-gallus.norm.flt-indels.Q20.DP10.bcf
    ├── gallus-gallus.norm.flt-indels.Q20.DP10.bcf.csi
    ├── peromyscus-maniculatus.norm.flt-indels.Q20.DP10.bcf
    ├── peromyscus-maniculatus.norm.flt-indels.Q20.DP10.bcf.csi
    ├── rana-sphenocephafa.norm.flt-indels.Q20.DP10.bcf
    └── rana-sphenocephafa.norm.flt-indels.Q20.DP10.bcf.csi

Once the “correction” process has been run, you can re-input the corrected contigs to the phyluce analysis process from the phyluce_assembly_match_contigs_to_probes program.