.. _methods:

==========================================================
Some informal notes on the methods implemented in ``wgd``
==========================================================

:Author: Arthur Zwaenepoel

.. contents::
   :depth: 3
..

Introduction
============

``wgd`` is a python package and command line suite for analyzing ancient whole
genome duplications (WGDs) in genomic data sets. The main focus of ``wgd`` is
on the construction of whole paranome (the collection of all duplicate genes in
a genome) |Ks| distributions, but it also provides tools for the inference of
intragenomic synteny and anchor points as well as downstream analyses and
visualizations of whole paranome |Ks| distributions. On this page we provide
some background information and informal explanations about the main functionality
and methods, namely the inference of these so-called |Ks| distributions. This
is not a tutorial on how to use ``wgd``, for that, please have a look at the
:ref:`command line interface <wgd_cli>` or at the supplementary material of 
our Bioinformatics paper.

What is a |Ks| distribution?
==============================

|Ks|, also called d\ *S*, denotes the synonymous distance between two
protein- coding DNA sequences (CDS), where synonymous distance means the
expected number of synonymous substitutions per synonymous site. A
|Ks| estimate for any two coding sequences can be acquired in two main
ways, (1) by means of heuristic counting methods such as the method of
Nei & Gojobori (1986) or (2) by using Markov models of codon
substitution with Maximum-likelihood (ML) estimation algorithms. The
latter methods are most commonly used nowadays, and trace back to the
pioneering work of Goldman & Yang (1994) and Muse & Gaut (1994). For
more information on the estimation of pairwise synonymous distances, we
refer to the wonderful textbooks on molecular evolution authored by
professor Ziheng Yang (e.g. Chapter 2 in Yang 2006).

Since synonymous substitutions are presumed to have largely negligible
effects on fitness, they are commonly regarded as neutral. Therefore we
expect the synonymous distance between two protein coding sequences to
increase with time in a stochasti clock like fashion. It is therefore
assumed that |Ks| can serve as a proxy for time, and that a larger
|Ks| value indicates an older divergence time between two sequences.

A whole paranome |Ks| distribution is nothing more than the
distribution of all |Ks| estimates for all inferred duplication events
in the genome of some species. It therefore serves as a proxy for the
distribution of the divergence times of all duplication events that have
left a trace in the genome of interest. Under the assumption of a
constant gene duplication rate and a constant rate of loss of duplicated
copies, such a distribution is expected to show an exponential decay
shape (Lynch & Conery 2000, Lynch 2007), which is the result of a quasi-
equilibrium linear birth-death process. A WGD event entails the
duplication of all gene copies in the genome at some point in time
(*i.e.* the polyploidization event) and subsequent massive loss or
divergence of duplicated copies in a relatively short time frame (*i.e.*
the rediploidization phase). This process of WGD followed by
rediploidization is expected to leave a large number of duplication
events with similar divergence times (and hence similar |Ks| values)
that can be traced back in the extant genome of interest. As a result a
WGD is expected to leave a peak signature in the |Ks| distribution
that cannot be explained by constant rate small-scale duplication as in
the model of Lynch & Conery.

As a side note, we note that if we assume synonymous substitution is a Poisson
process, and the synonymous substitution rate is Gamma distributed across gene
families, this WGD peak signature is expected to follow a negative binomial
distribution in the ditribution of numbers of synonymous substititions (which
is not the same as |Ks| distribution).  In the |Ks| distribution, this would
translate to a peak which has a distribution with positive skew, which we could
approximate by a Gamma or Log-Normal distribution. These considerations are
important for mixture modeling. An *ad hoc* illustration of such a hypothetical
|Ks| distribution is shown below. The left plot shows the full |Ks| distribution,
whereas the right plot shows the decomposition into the background signature from
the continuous birth, death and fixation of small-scale duplicates (gray) and the
WGD signature (in blue, here modeled as a log-normal distribution).

.. figure:: adhoc.png


How is the |Ks| distribution computed in ``wgd``?
===================================================

While the basic idea seems simple enough, constructing a |Ks|
distribution is a rather involved business. This is the main reason why
we released ``wgd``, to make this seemingly simple approach actually
simple to perform. Below we elaborate on the main steps undertken in a
|Ks| distribution construction.

Inferring duplicates
--------------------

First, we need to infer the duplicate genes in the genome. To this end,
commonly employed homolog clustering approaches are used in ``wgd mcl``.
Since we are however not interested in multi-species gene families,
these homolog clustering methods are used only within a particular
genome of interest. ``wgd mcl`` clusters genes into paralogous families
(where all genes are paralogous to each other, *i.e.* diverged after
some duplication event) by

1. Doing an all-\ *versus*-all Blastp similarity search
2. Constructing a sequence similarity graph based on the Blastp results
3. Finding clusters in the graph by means of an unsupervised approach
   using the Markov Clustering algorithm (MCL, Vandongen 2000).

For more information on these approaches, please consult literature on
homolog clustering (e.g. OrthoMCL, OrthoFinder), but bear in mind that
we are only concerned with within genome homolog clusters, *i.e.*
paralogous gene families.

Calculating *pairwise* Ks values
--------------------------------

For every paralogous family we inferred, we first construct a multiple
sequence alignment (MSA) using standard practices (in ``wgd ksd`` you
can choose between MUSCLE, MAFFT and PRANK for alignment). Alignments
are constructed at the amino acid level as these can be more reliable,
and are backtranslated to nucleotide sequences to obtain a
codon-alignment.

Now that we have our paralogous gene families and MSAs, we can compute
for every pair in the family the synonymous distance, |Ks|. To this
end ``wgd ksd`` uses the ``codeml`` program from the PAML package.
``codeml`` has tons of parameter settings and algorithmic options, and
those used in ``wgd ksd`` will not be reiterated here but can be found
in the supplement to Zwaenepoel & Van de Peer (2018) or Vanneste *et
al.* (2013). In general, the settings are set such that they do a good
job at estimating pairwise |Ks| values reliably by maximum likelihood.

Calculating Ks estimates for duplication *events*
-------------------------------------------------

Note that after the previous step step we have n(n-1)/2 pairwise |Ks|
estimates for a paralogous family with n genes, whereas what we really
want is Ks estimate for each *duplication event* in the family. If
multiple duplications happened, we will have multiple redundant |Ks|
estimates for internal duplication nodes in the gene family tree. To see
this, have a look at the figure below, where a paralogous gene tree is
shown, with two marked internal duplication nodes and their respective
|Ks| values.

.. figure:: tree.png

We are interested in the divergence time (approximated by |Ks|) for the five
internal nodes of this tree, whereas what we really have is 6x5/2= 15 |Ks|
estimates between all pairs of leaves. As is clear from this figure, we have 3
redundant estimates for the orange node, whereas we have 8 redundant estimates
for the green node.  We can correct for this redundancy by either averaging
|Ks| values for each node (or taking some other summary statistic like the
median), or by computing a weight for every |Ks| value. (see also next section).

However, to do this we of course need the tree, therefore ``wgd ksd`` will
compute for every paralogous gene family a phylogenetic tree after calculating
the pairwise |Ks| values to do redundancy correction (either by node-averaging
or node-weighting, see below). In ``wgd ksd`` one can choose between three
options to construct this tree, either average linkage clustering of |Ks|
values (which is fast but somewhat crude), approximate ML tree inference using
FastTree (Price *et al.*) or ML tree inference using PhyML (which is obviously
the slowest option).

After the tree is computed ``wgd ksd`` will write for every pair the Ks
estimate, some alignment statistics, the node in the tree for which the Ks
value provides a divergence estimate, the gene family, the various distance
estimates (|Ks|, |Ka| and ω) and the weights computed without filtering out
outliers and with filtering out outliers (see below).

Visualizing a |Ks| distribution
===============================

If all goes well, ``wgd ksd`` will compute a |Ks| distribution which looks
somewhat like this (where I have omitted lots of columns that are not really
of interest currently)::

    	                        Family	    Ks         Node     WeightOutliersExcluded
    AT3G11180__AT5G05600	GF_000093   0.7679     23.0     1.00
    AT2G38240__AT5G05600	GF_000093   1.9122     25.0     0.33
    AT2G38240__AT3G11180	GF_000093   2.5219     25.0     0.33
    AT5G08640__AT5G63580	GF_000093   1.3265     30.0     0.20
    ...                         ...

Now, simply plotting a histogram of the resulting |Ks| values will of course be
flawed, since we will plot all redundant estimates with equal weight, and since
older duplication events will have more pairwise estimates, this may
artificially amplify the signal of old duplication events. 

There are two main approaches to overcome this, already noted in the previous
section. The first is **node-averaging**, where one computes a summarized |Ks|
value for each node. This is easily performed using built-in functions to work
with data frames in R or Python. This approach obviously throws away the
information in the individual pairwise estimates and only uses the summarized
value in subsequent analysis. The second approach is **node-weighting**, where
one computes a weight for every pairwise estimate, and adds every single
estimate to the histogram but with some weighted based on the number of
redundant etimates.  Such an approach does not throw away the information in
individual estimates, but is trickier to work with in subsequent analyses such
as Kernel density estimation (KDE) or Gaussian mixture modeling (GMMs).
Weighted histograms can be plotted both in ``ggplot2`` in R and ``matplotlib``
in Python.

It is important to be aware that different filters on the |Ks| range, will
cause the weights to change. Indeed, if some pairwise estimate is filtered out,
because of some filtering riterion, the number of |Ks| estimates will change
for that particular duplication event. This is exactly the difference between
the `WeightOutliersIncluded` column and `WeightOutliersExcluded` column, where
the weights in the latter column are based on the number of pairwise estimates
after filtering, whereas the former has weights based on the number of pairwise
estimates before filtering. It is does important to recompute the weights when 
different filtering riteria are used. The tools in ``wgd`` will always do this,
but it is important to keep in mind when analyzing the dat yourelf in R or some
other statistical/plotting environment.

Note that ``wgd ksd`` by default outputs node-weighted histograms, whereas
node-averaged histograms can be generated using ``wgd viz``. The mixture
modeling and KDE methods in ``wgd mix`` and ``wgd kde`` use node-averaged
histograms for modeling purposes.

Downstream analyses
===================

Downstream analyses consist mainly of fitting different models to a |Ks| 
distribution, which could be KDEs, or mixture models. For more information
on mixture modeling of |Ks| distributions, we refer to :ref:`note_on_gmms`.

.. |Ks| replace:: `K`\ :sub:`S`
.. |Ka| replace:: `K`\ :sub:`A`