Command line interface


Copyright (C) 2018 Arthur Zwaenepoel

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

Contact: arzwa@psb.vib-ugent.be


The command line interface (CLI) is the main way to interact with the wgd package. The CLI is organized with a Click command that wraps a function with the same name followed by an underscore (this is chosen mostly so that the pipeline commands can reuse code from other subcommands).

Upon successful installation you should be able to run:

$ wgd -h

Usage: wgd [OPTIONS] COMMAND [ARGS]...

  Welcome to the wgd command line interface!

                         _______
                         \  ___ `'.
         _     _ .--./)   ' |--.\  \
   /\    \   ///.''\    | |    \  '
   `\  //\\ //| |  | |   | |     |  '
     \`//  \'/  \`-' /    | |     |  |
      \|   |/   /("'`     | |     ' .'
       '        \ '---.   | |___.' /'
                 /'""'.\ /_______.'/
                ||     ||\_______|/
                \'. __//
                 `'---'

  wgd  Copyright (C) 2018 Arthur Zwaenepoel
  This program comes with ABSOLUTELY NO WARRANTY;
  This is free software, and you are welcome to redistribute it
  under certain conditions;

  Contact: arzwa@psb.vib-ugent.be

Options:
  -v, --verbosity [info|debug]  Verbosity level, default = info.
  -l, --logfile TEXT            File to write logs to (optional)
  -h, --help                    Show this message and exit.

Commands:
  kde  Fit a KDE to a Ks distribution.
  ksd  Ks distribution construction.
  mcl  All-vs.-all blastp + MCL analysis.
  mix  Mixture modeling of Ks distributions.
  syn  Co-linearity analyses.
  viz  Plot histograms/densities (interactively).
  wf1  Standard workflow whole paranome Ks.
  wf2  Standard workflow one-vs-one ortholog Ks.

Note that the verbose flag can be set before the subcommands:

$ wgd --verbose [silent|info|debug] [COMMAND]

Which sets the verbosity for the logging.

All commands are equipped with usage instructions and documentation of options and arguments, which can be viewed by using the --help or -h flag. These should be quite self-explanatory, but for further documentation you can refer to the documentation of the specific functions that are called. These can be found on this page (e.g. the function called by wgd blast is wgd_cli.blast_()).

For more information on the methods used in wgd to compute KS distributions, please refer to Some informal notes on the methods implemented in wgd.

Example

Here is a small example on how to use the package through the CLI. This is a workflow for constructing a KS distribution for a fasta file with CDS sequences called ath.cds.fasta. File names may be different, but the point will be clear.

(1) Get the paranome, i.e. perform all-against-all Blastp and MCL clustering, notice how we specify to use 8 threads:

$ wgd mcl --cds --mcl -s ath.cds.fasta -o ./ -n 8

(2) Construct a KS distribution, use FastTree for inferring the phylogenetic trees used in the node weighting procedure:

$ wgd ksd -o ./ -n 8 ./ath.mcl ath.cds.fasta

(3) Run I-ADHoRe and get an anchor-point KS distribution, as well as dotplots. here we need a structural annotation in GFF format (see e.g. https://bioinformatics.psb.ugent.be/plaza/versions/plaza_v4_dicots/download/ for examples of such files):

$ wgd syn ath.gff ath.mcl -ks ath.mcl.ks.tsv -f gene -a ID

(4) Fit some gaussian mixture models with 1 to 5 components (default parameters):

$ wgd mix ath.mcl.ks.tsv -n 1 5

for more information on mixture modeling and some cautionary notes refer to A note on mixture models for KS distributions

(5) Explore the full and anchor distribution with kernel density estimates interactively. First run a bokeh server in the background:

$ bokeh serve &

next, execute the following command:

$ wgd viz -i -ks ath.mcl.ks.tsv,ath.mcl.ks_anchors.tsv -l full,anchors

a tab in your default browser should appear. See Visualization module for more information on vizualization with wgd viz


Reference

wgd_cli.blast_mcl(cds=True, mcl=True, one_v_one=False, sequences=None, species_ids=None, blast_results=None, inflation_factor=2.0, eval_cutoff=1e-10, output_dir='wgd_blast', n_threads=4)

All vs. all Blast + MCL pipeline. For usage in the wgd CLI. Can be used to perform all vs. all Blast, MCL clustering and one vs. one ortholog delineation.

Parameters:
  • cds – boolean, indicates that the provided sequences are CDS
  • mcl – boolean, perform MCL clustering
  • one_v_one – boolean, identify whether one vs. one orthologs are to be inferred (reciprocal best hits) (True) or a paranome (False).
  • sequences – CDS fasta files, if multiple (for one vs. one ortholog identification), then as a comma-separated string e.g. ath.fasta,aly.fasta
  • species_ids – comma-separated species ids, optional for one-vs-one ortholog delineation (will prefix the sequence IDs in that case).
  • blast_results – precomputed blast results (tab separated blast output style)
  • inflation_factor – inflation factor for MCL clustering
  • eval_cutoff – e-value cut off for blastp analysis
  • output_dir – output directory
  • n_threads – number of threads to use
Returns:

output file name

wgd_cli.kde_(ks_distribution, filters, ks_range, bandwidth, bins, output_file)

Fit a KDE to a Ks distribution.

Parameters:
  • ks_distribution – Ks distribution file
  • filters – alignment filters
  • ks_range – Ks range
  • bandwidth – bandwidth
  • bins – number of histogram bins
  • output_file – output file
Returns:

nada

wgd_cli.ksd_(gene_families, sequences, output_directory, protein_sequences=None, tmp_dir=None, aligner='muscle', codeml='codeml', times=1, min_msa_length=100, ignore_prefixes=False, one_v_one=False, pairwise=False, preserve=False, n_threads=4, weighting_method='fasttree', max_pairwise=10000, **kwargs)

Ks distribution construction pipeline. For usage in the wgd CLI.

Parameters:
  • gene_families – gene families, i.e. tab separated paralogs or one-vs-one orthologs (see blast_())
  • sequences – CDS fasta files, if multiple (for constructing one-vs.-one ortholog distribution) then as a comma separated string
  • output_directory – output directory
  • protein_sequences – protein sequences (optional), by default CDS files are translated using the standard genetic code.
  • tmp_dir – tmp directory name (optional)
  • aligner – aligner to use
  • codeml – path to codeml executable
  • times – number of times to iteratively perform ML estimation of Ks, Ka and omega values.
  • min_msa_length – minimum multiple sequence alignment length
  • ignore_prefixes – ignore prefixes defined by ‘|’ in gene IDs
  • one_v_one – boolean, one-vs.-one ortholog analysis
  • pairwise – run in pairwise mode
  • preserve – boolean, preserve codeml output files, multiple sequence alignments and trees?
  • async – use the async library for parallelization (not recommended)
  • n_threads – number of threads to use
  • weighting_method – weighting method (fasttree, phyml or alc)
  • max_pairwise – maximum number of pairwise combinations a gene family may have. This effectively filters out families of size n where n*(n-1)/2 exceeds max_pairwise.
Returns:

output file name

wgd_cli.mix_(ks_distribution, filters, ks_range, method, components, bins, output_dir, gamma, n_init, max_iter)

Mixture modeling tools.

Note that histogram weighting is done after applying specified filters. Also note that mixture models are fitted to node-averaged (not weighted) histograms. Please interpret mixture model results with caution, for more info, refer to A note on mixture models for KS distributions.

Parameters:
  • ks_distribution – Ks distribution data frame
  • filters – alignment stats filters
  • ks_range – Ks range used for models
  • method – mixture modeling method, Bayesian/ordinary Gaussian mixtures
  • components – number of components to use (tuple: (min, max))
  • bins – number histogram bins for visualization
  • output_dir – output directory
  • gamma – gamma parameter for BGMM
  • n_init – number of k-means initializations (best is kept)
  • max_iter – number of iterations
Returns:

nada

wgd_cli.syn_(gff_file, families, output_dir, ks_distribution, feature='mRNA', gene_attribute='Parent', min_length=250, ks_range=(0.05, 5), **kwargs)

Co-linearity analysis with I-ADHoRe 3.0. For usage in the wgd CLI.

Parameters:
  • gff_file – GFF3 annotation file (see the annotation files on PLAZA as an example)
  • families – gene families as tab separated gene IDs, see blast_()
  • output_dir – output directory
  • ks_distribution – Ks distribution tsv file, see ks_()
  • feature – keyword for entities of interest in the GFF file, e.g. ‘CDS’ or ‘mRNA’
  • gene_attribute – attribute key for the gene ID in the GFF (9th column), e.g. ‘ID’ or ‘Parent’
Returns:

nothing at all

wgd_cli.viz_(ks_distributions, alpha_values, colors, labels, hist_type, title, output_file, filters, ks_range, bins, interactive=False, weighted=False)

Plot (stacked) histograms (interactively). Add option to plot node-weighted histograms in the same fashion.

Parameters:
  • ks_distributions – a directory with ks distributions (other files are ignored) or a comma-separated string of file names
  • alpha_values – alpha values for the different distributions (in the same order). Only relevant for non-interactive visualization.
  • colors – as in alpha_values but for colors
  • labels – as in alpha_values but for legend labels (by default the file names are used), this is also relevant for the interactive bokeh visualization (as opposed to alpha_values and colors.
  • hist_type – histogram type (matplotlib), either ‘barstacked’, ‘step’ or ‘stepfilled’.
  • title – plot title
  • output_file – output file name
  • interactive – render an interactive bokeh plot. This makes some of the above arguments redundant
Returns:

nada