Command line interface¶
Copyright (C) 2018 Arthur Zwaenepoel
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.
Contact: arzwa@psb.vib-ugent.be
The command line interface (CLI) is the main way to interact with the wgd package. The CLI is organized with a Click command that wraps a function with the same name followed by an underscore (this is chosen mostly so that the pipeline commands can reuse code from other subcommands).
Upon successful installation you should be able to run:
$ wgd -h
Usage: wgd [OPTIONS] COMMAND [ARGS]...
Welcome to the wgd command line interface!
_______
\ ___ `'.
_ _ .--./) ' |--.\ \
/\ \ ///.''\ | | \ '
`\ //\\ //| | | | | | | '
\`// \'/ \`-' / | | | |
\| |/ /("'` | | ' .'
' \ '---. | |___.' /'
/'""'.\ /_______.'/
|| ||\_______|/
\'. __//
`'---'
wgd Copyright (C) 2018 Arthur Zwaenepoel
This program comes with ABSOLUTELY NO WARRANTY;
This is free software, and you are welcome to redistribute it
under certain conditions;
Contact: arzwa@psb.vib-ugent.be
Options:
-v, --verbosity [info|debug] Verbosity level, default = info.
-l, --logfile TEXT File to write logs to (optional)
-h, --help Show this message and exit.
Commands:
kde Fit a KDE to a Ks distribution.
ksd Ks distribution construction.
mcl All-vs.-all blastp + MCL analysis.
mix Mixture modeling of Ks distributions.
syn Co-linearity analyses.
viz Plot histograms/densities (interactively).
wf1 Standard workflow whole paranome Ks.
wf2 Standard workflow one-vs-one ortholog Ks.
Note that the verbose flag can be set before the subcommands:
$ wgd --verbose [silent|info|debug] [COMMAND]
Which sets the verbosity for the logging.
All commands are equipped with usage instructions and documentation of options
and arguments, which can be viewed by using the --help
or -h
flag. These
should be quite self-explanatory, but for further documentation you can refer to
the documentation of the specific functions that are called. These can be found
on this page (e.g. the function called by wgd blast
is
wgd_cli.blast_()
).
For more information on the methods used in wgd
to compute KS distributions,
please refer to Some informal notes on the methods implemented in wgd.
Example¶
Here is a small example on how to use the package through the CLI. This is a
workflow for constructing a KS distribution for a fasta file with CDS
sequences called ath.cds.fasta
. File names may be different, but the point
will be clear.
(1) Get the paranome, i.e. perform all-against-all Blastp and MCL clustering, notice how we specify to use 8 threads:
$ wgd mcl --cds --mcl -s ath.cds.fasta -o ./ -n 8
(2) Construct a KS distribution, use FastTree for inferring the phylogenetic trees used in the node weighting procedure:
$ wgd ksd -o ./ -n 8 ./ath.mcl ath.cds.fasta
(3) Run I-ADHoRe and get an anchor-point KS distribution, as well as dotplots. here we need a structural annotation in GFF format (see e.g. https://bioinformatics.psb.ugent.be/plaza/versions/plaza_v4_dicots/download/ for examples of such files):
$ wgd syn ath.gff ath.mcl -ks ath.mcl.ks.tsv -f gene -a ID
(4) Fit some gaussian mixture models with 1 to 5 components (default parameters):
$ wgd mix ath.mcl.ks.tsv -n 1 5
for more information on mixture modeling and some cautionary notes refer to A note on mixture models for KS distributions
(5) Explore the full and anchor distribution with kernel density estimates interactively. First run a bokeh server in the background:
$ bokeh serve &
next, execute the following command:
$ wgd viz -i -ks ath.mcl.ks.tsv,ath.mcl.ks_anchors.tsv -l full,anchors
a tab in your default browser should appear. See Visualization module for more
information on vizualization with wgd viz
Reference¶
-
wgd_cli.
blast_mcl
(cds=True, mcl=True, one_v_one=False, sequences=None, species_ids=None, blast_results=None, inflation_factor=2.0, eval_cutoff=1e-10, output_dir='wgd_blast', n_threads=4)¶ All vs. all Blast + MCL pipeline. For usage in the
wgd
CLI. Can be used to perform all vs. all Blast, MCL clustering and one vs. one ortholog delineation.Parameters: - cds – boolean, indicates that the provided sequences are CDS
- mcl – boolean, perform MCL clustering
- one_v_one – boolean, identify whether one vs. one orthologs are to be inferred (reciprocal best hits) (True) or a paranome (False).
- sequences – CDS fasta files, if multiple (for one vs. one ortholog
identification), then as a comma-separated string e.g.
ath.fasta,aly.fasta
- species_ids – comma-separated species ids, optional for one-vs-one ortholog delineation (will prefix the sequence IDs in that case).
- blast_results – precomputed blast results (tab separated blast output style)
- inflation_factor – inflation factor for MCL clustering
- eval_cutoff – e-value cut off for blastp analysis
- output_dir – output directory
- n_threads – number of threads to use
Returns: output file name
-
wgd_cli.
kde_
(ks_distribution, filters, ks_range, bandwidth, bins, output_file)¶ Fit a KDE to a Ks distribution.
Parameters: - ks_distribution – Ks distribution file
- filters – alignment filters
- ks_range – Ks range
- bandwidth – bandwidth
- bins – number of histogram bins
- output_file – output file
Returns: nada
-
wgd_cli.
ksd_
(gene_families, sequences, output_directory, protein_sequences=None, tmp_dir=None, aligner='muscle', codeml='codeml', times=1, min_msa_length=100, ignore_prefixes=False, one_v_one=False, pairwise=False, preserve=False, n_threads=4, weighting_method='fasttree', max_pairwise=10000, **kwargs)¶ Ks distribution construction pipeline. For usage in the
wgd
CLI.Parameters: - gene_families – gene families, i.e. tab separated paralogs or
one-vs-one orthologs (see
blast_()
) - sequences – CDS fasta files, if multiple (for constructing one-vs.-one ortholog distribution) then as a comma separated string
- output_directory – output directory
- protein_sequences – protein sequences (optional), by default CDS files are translated using the standard genetic code.
- tmp_dir – tmp directory name (optional)
- aligner – aligner to use
- codeml – path to codeml executable
- times – number of times to iteratively perform ML estimation of Ks, Ka and omega values.
- min_msa_length – minimum multiple sequence alignment length
- ignore_prefixes – ignore prefixes defined by ‘|’ in gene IDs
- one_v_one – boolean, one-vs.-one ortholog analysis
- pairwise – run in pairwise mode
- preserve – boolean, preserve codeml output files, multiple sequence alignments and trees?
- async – use the async library for parallelization (not recommended)
- n_threads – number of threads to use
- weighting_method – weighting method (fasttree, phyml or alc)
- max_pairwise – maximum number of pairwise combinations a gene family may have. This effectively filters out families of size n where n*(n-1)/2 exceeds max_pairwise.
Returns: output file name
- gene_families – gene families, i.e. tab separated paralogs or
one-vs-one orthologs (see
-
wgd_cli.
mix_
(ks_distribution, filters, ks_range, method, components, bins, output_dir, gamma, n_init, max_iter)¶ Mixture modeling tools.
Note that histogram weighting is done after applying specified filters. Also note that mixture models are fitted to node-averaged (not weighted) histograms. Please interpret mixture model results with caution, for more info, refer to A note on mixture models for KS distributions.
Parameters: - ks_distribution – Ks distribution data frame
- filters – alignment stats filters
- ks_range – Ks range used for models
- method – mixture modeling method, Bayesian/ordinary Gaussian mixtures
- components – number of components to use (tuple: (min, max))
- bins – number histogram bins for visualization
- output_dir – output directory
- gamma – gamma parameter for BGMM
- n_init – number of k-means initializations (best is kept)
- max_iter – number of iterations
Returns: nada
-
wgd_cli.
syn_
(gff_file, families, output_dir, ks_distribution, feature='mRNA', gene_attribute='Parent', min_length=250, ks_range=(0.05, 5), **kwargs)¶ Co-linearity analysis with I-ADHoRe 3.0. For usage in the
wgd
CLI.Parameters: - gff_file – GFF3 annotation file (see the annotation files on PLAZA as an example)
- families – gene families as tab separated gene IDs, see
blast_()
- output_dir – output directory
- ks_distribution – Ks distribution tsv file, see
ks_()
- feature – keyword for entities of interest in the GFF file, e.g. ‘CDS’ or ‘mRNA’
- gene_attribute – attribute key for the gene ID in the GFF (9th column), e.g. ‘ID’ or ‘Parent’
Returns: nothing at all
-
wgd_cli.
viz_
(ks_distributions, alpha_values, colors, labels, hist_type, title, output_file, filters, ks_range, bins, interactive=False, weighted=False)¶ Plot (stacked) histograms (interactively). Add option to plot node-weighted histograms in the same fashion.
Parameters: - ks_distributions – a directory with ks distributions (other files are ignored) or a comma-separated string of file names
- alpha_values – alpha values for the different distributions (in the same order). Only relevant for non-interactive visualization.
- colors – as in
alpha_values
but for colors - labels – as in
alpha_values
but for legend labels (by default the file names are used), this is also relevant for the interactive bokeh visualization (as opposed toalpha_values
andcolors
. - hist_type – histogram type (matplotlib), either ‘barstacked’, ‘step’ or ‘stepfilled’.
- title – plot title
- output_file – output file name
- interactive – render an interactive bokeh plot. This makes some of the above arguments redundant
Returns: nada