Command line interface¶

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

Contact: arzwa@psb.vib-ugent.be

The command line interface (CLI) is the main way to interact with the wgd package. The CLI is organized with a Click command that wraps a function with the same name followed by an underscore (this is chosen mostly so that the pipeline commands can reuse code from other subcommands).

Upon successful installation you should be able to run:

$ wgd -h

Usage: wgd [OPTIONS] COMMAND [ARGS]...

  Welcome to the wgd command line interface!

                         _______
                         \  ___ `'.
         _     _ .--./)   ' |--.\  \
   /\    \   ///.''\    | |    \  '
   `\  //\\ //| |  | |   | |     |  '
     \`//  \'/  \`-' /    | |     |  |
      \|   |/   /("'`     | |     ' .'
       '        \ '---.   | |___.' /'
                 /'""'.\ /_______.'/
                ||     ||\_______|/
                \'. __//
                 `'---'

  wgd  Copyright (C) 2018 Arthur Zwaenepoel
  This program comes with ABSOLUTELY NO WARRANTY;
  This is free software, and you are welcome to redistribute it
  under certain conditions;

  Contact: arzwa@psb.vib-ugent.be

Options:
  -v, --verbosity [info|debug]  Verbosity level, default = info.
  -l, --logfile TEXT            File to write logs to (optional)
  -h, --help                    Show this message and exit.

Commands:
  kde  Fit a KDE to a Ks distribution.
  ksd  Ks distribution construction.
  mcl  All-vs.-all blastp + MCL analysis.
  mix  Mixture modeling of Ks distributions.
  syn  Co-linearity analyses.
  viz  Plot histograms/densities (interactively).
  wf1  Standard workflow whole paranome Ks.
  wf2  Standard workflow one-vs-one ortholog Ks.

Note that the verbose flag can be set before the subcommands:

$ wgd --verbose [silent|info|debug] [COMMAND]

Which sets the verbosity for the logging.

All commands are equipped with usage instructions and documentation of options and arguments, which can be viewed by using the --help or -h flag. These should be quite self-explanatory, but for further documentation you can refer to the documentation of the specific functions that are called. These can be found on this page (e.g. the function called by wgd blast is wgd_cli.blast_()).

For more information on the methods used in wgd to compute K_S distributions, please refer to Some informal notes on the methods implemented in wgd.

Example¶

Here is a small example on how to use the package through the CLI. This is a workflow for constructing a K_S distribution for a fasta file with CDS sequences called ath.cds.fasta. File names may be different, but the point will be clear.

(1) Get the paranome, i.e. perform all-against-all Blastp and MCL clustering, notice how we specify to use 8 threads:

$ wgd mcl --cds --mcl -s ath.cds.fasta -o ./ -n 8

(2) Construct a K_S distribution, use FastTree for inferring the phylogenetic trees used in the node weighting procedure:

$ wgd ksd -o ./ -n 8 ./ath.mcl ath.cds.fasta

(3) Run I-ADHoRe and get an anchor-point K_S distribution, as well as dotplots. here we need a structural annotation in GFF format (see e.g. https://bioinformatics.psb.ugent.be/plaza/versions/plaza_v4_dicots/download/ for examples of such files):

$ wgd syn ath.gff ath.mcl -ks ath.mcl.ks.tsv -f gene -a ID

(4) Fit some gaussian mixture models with 1 to 5 components (default parameters):

$ wgd mix ath.mcl.ks.tsv -n 1 5

for more information on mixture modeling and some cautionary notes refer to A note on mixture models for KS distributions

(5) Explore the full and anchor distribution with kernel density estimates interactively. First run a bokeh server in the background:

$ bokeh serve &

next, execute the following command:

$ wgd viz -i -ks ath.mcl.ks.tsv,ath.mcl.ks_anchors.tsv -l full,anchors

a tab in your default browser should appear. See Visualization module for more information on vizualization with wgd viz

Reference¶

wgd_cli.blast_mcl(cds=True, mcl=True, one_v_one=False, sequences=None, species_ids=None, blast_results=None, inflation_factor=2.0, eval_cutoff=1e-10, output_dir='wgd_blast', n_threads=4)¶

All vs. all Blast + MCL pipeline. For usage in the wgd CLI. Can be used to perform all vs. all Blast, MCL clustering and one vs. one ortholog delineation.

Parameters:

cds – boolean, indicates that the provided sequences are CDS
mcl – boolean, perform MCL clustering
one_v_one – boolean, identify whether one vs. one orthologs are to be inferred (reciprocal best hits) (True) or a paranome (False).
sequences – CDS fasta files, if multiple (for one vs. one ortholog identification), then as a comma-separated string e.g. ath.fasta,aly.fasta
species_ids – comma-separated species ids, optional for one-vs-one ortholog delineation (will prefix the sequence IDs in that case).
blast_results – precomputed blast results (tab separated blast output style)
inflation_factor – inflation factor for MCL clustering
eval_cutoff – e-value cut off for blastp analysis
output_dir – output directory
n_threads – number of threads to use

Returns:

output file name

wgd_cli.kde_(ks_distribution, filters, ks_range, bandwidth, bins, output_file)¶

Fit a KDE to a Ks distribution.

Parameters:	ks_distribution – Ks distribution file filters – alignment filters ks_range – Ks range bandwidth – bandwidth bins – number of histogram bins output_file – output file
Returns:	nada

wgd_cli.ksd_(gene_families, sequences, output_directory, protein_sequences=None, tmp_dir=None, aligner='muscle', codeml='codeml', times=1, min_msa_length=100, ignore_prefixes=False, one_v_one=False, pairwise=False, preserve=False, n_threads=4, weighting_method='fasttree', max_pairwise=10000, **kwargs)¶

Ks distribution construction pipeline. For usage in the wgd CLI.

Parameters:

gene_families – gene families, i.e. tab separated paralogs or one-vs-one orthologs (see blast_())
sequences – CDS fasta files, if multiple (for constructing one-vs.-one ortholog distribution) then as a comma separated string
output_directory – output directory
protein_sequences – protein sequences (optional), by default CDS files are translated using the standard genetic code.
tmp_dir – tmp directory name (optional)
aligner – aligner to use
codeml – path to codeml executable
times – number of times to iteratively perform ML estimation of Ks, Ka and omega values.
min_msa_length – minimum multiple sequence alignment length
ignore_prefixes – ignore prefixes defined by ‘|’ in gene IDs
one_v_one – boolean, one-vs.-one ortholog analysis
pairwise – run in pairwise mode
preserve – boolean, preserve codeml output files, multiple sequence alignments and trees?
async – use the async library for parallelization (not recommended)
n_threads – number of threads to use
weighting_method – weighting method (fasttree, phyml or alc)
max_pairwise – maximum number of pairwise combinations a gene family may have. This effectively filters out families of size n where n*(n-1)/2 exceeds max_pairwise.

Returns:

output file name

wgd_cli.mix_(ks_distribution, filters, ks_range, method, components, bins, output_dir, gamma, n_init, max_iter)¶

Mixture modeling tools.

Note that histogram weighting is done after applying specified filters. Also note that mixture models are fitted to node-averaged (not weighted) histograms. Please interpret mixture model results with caution, for more info, refer to A note on mixture models for KS distributions.

Parameters:

ks_distribution – Ks distribution data frame
filters – alignment stats filters
ks_range – Ks range used for models
method – mixture modeling method, Bayesian/ordinary Gaussian mixtures
components – number of components to use (tuple: (min, max))
bins – number histogram bins for visualization
output_dir – output directory
gamma – gamma parameter for BGMM
n_init – number of k-means initializations (best is kept)
max_iter – number of iterations

Returns:

nada

wgd_cli.syn_(gff_file, families, output_dir, ks_distribution, feature='mRNA', gene_attribute='Parent', min_length=250, ks_range=(0.05, 5), **kwargs)¶

Co-linearity analysis with I-ADHoRe 3.0. For usage in the wgd CLI.

Parameters:

gff_file – GFF3 annotation file (see the annotation files on PLAZA as an example)
families – gene families as tab separated gene IDs, see blast_()
output_dir – output directory
ks_distribution – Ks distribution tsv file, see ks_()
feature – keyword for entities of interest in the GFF file, e.g. ‘CDS’ or ‘mRNA’
gene_attribute – attribute key for the gene ID in the GFF (9th column), e.g. ‘ID’ or ‘Parent’

Returns:

nothing at all

wgd_cli.viz_(ks_distributions, alpha_values, colors, labels, hist_type, title, output_file, filters, ks_range, bins, interactive=False, weighted=False)¶

Plot (stacked) histograms (interactively). Add option to plot node-weighted histograms in the same fashion.

Parameters:

ks_distributions – a directory with ks distributions (other files are ignored) or a comma-separated string of file names
alpha_values – alpha values for the different distributions (in the same order). Only relevant for non-interactive visualization.
colors – as in alpha_values but for colors
labels – as in alpha_values but for legend labels (by default the file names are used), this is also relevant for the interactive bokeh visualization (as opposed to alpha_values and colors.
hist_type – histogram type (matplotlib), either ‘barstacked’, ‘step’ or ‘stepfilled’.
title – plot title
output_file – output file name
interactive – render an interactive bokeh plot. This makes some of the above arguments redundant

Returns:

nada

Command line interface¶

Example¶

Reference¶

Table of Contents

Previous topic

Next topic

This Page