Mixture modeling **************** The mixture modeling features of `wgd` use the ``sci-kit learn`` (better known as ``sklearn`` library). .. _note_on_gmms: A note on mixture models for |Ks| distributions =============================================== Mixture models have been employed frequently to study WGDs with |Ks| distributions. Under several basic molecular evolutionary assumptions, the peak in the |Ks| distribution caused by a WGD is expected to show a distribution with positive skew, which can be reasonably well approximated by a log-normal distribution. Fitting mixtures of log-normal components and statistical evaluation of different model fits is therefore a reasonable strategy to locate WGD-derived peaks in a |Ks| distribution. However, mixture models are known to be prone to overfitting and overclustering, and this is especially true for |Ks| distributions where we have a lot of data points (see e.g. Tiley `et al.` (2018) for a recent study on mixture models for WGD inference). Therefore, **we do not advise to use mixture models as formal statistical tests of multiple WGD hypotheses**. We mainly regard mixture models as providing a somewhat more formal representation of what a researcher understands as a 'peak' giving evidence for a WGD. Additionaly, mixture models allow to obtain an estimate for the mean and variance of hypothesized WGD peaks. Also, mixture models allow a means of selecting paralogous pairs that are derived from a hypothesized WGD using quantitative measures. That is, given a fitted mixture model which we regard as representing our hypothesis of ancient WGDs, we can isolate those gene pairs that belong with, say, 95% probability to the component under the model. This is likely a preferable approach compared to applying arbitrary cut-offs based on visual inspection. A note on the practical difference between the BGMM and GMM method ================================================================== For algorithmic and theoretical details, we refer to http://scikit-learn.org/stable/modules/classes.html#module-sklearn.mixture. Here we give a pragmatic description. **Again we stress that mixture modeling results should not be taken as evidence for WGD as such, and should also be interpreted with caution (see above)!** The GMM method -------------- When using the GMM method, model selection proceeds by evaluating relative model fit using the Bayesian or Akaike information criterion ([B/A]IC). Comparison of AIC values across models can be used to assess which model fits the data best according to the AIC. As an example, consider the output of ``wgd mix``:: AIC assessment: min(AIC) = 25247.22 for model 4 Relative probabilities compared to model 4: / \ | (min(AIC) - AICi)/2 | | p = e | \ / .. model 1: p = 0.0000 .. model 2: p = 0.0000 .. model 3: p = 0.0000 .. model 4: p = 1.0000 .. model 5: p = 0.0005 The `p` computed by the formula shown in this output can be interpreted as proportional to the probability that model i minimizes the information loss. More specifically in this example, model 5 is 0.0005 as probable to minimize the expected information loss as model 4. In this case the AIC clearly supports the 4 component model. The BIC based model selection procedure is analogous. For every model fit we calculate the BIC value and we record the difference with the minimum BIC value (this is the Delta BIC value). If we interpret Delta BIC values as Bayes factors, we can again perform model selection:: Delta BIC assessment: min(BIC) = 25327.22 for model 4 .. model 1: delta(BIC) = 3970.57 ( >10: Very Strong) .. model 2: delta(BIC) = 1758.68 ( >10: Very Strong) .. model 3: delta(BIC) = 38.39 ( >10: Very Strong) .. model 4: delta(BIC) = 0.00 (0 to 2: Very weak) .. model 5: delta(BIC) = 37.17 ( >10: Very Strong) Where ``( >10: Very Strong)`` denotes very strong support of model 4 over another model. These results confirm the results of the AIC values. The BGMM Method --------------- The BGMM method uses a variational Bayes algorithm to fit infinite Gaussian mixture models using a Dirichlet process (DP) prior on the mixture components. This is a Bayesian nonparametric clustering approach and does not require, in principle to select an *a priori* fixed number of components. In theory, for an infinite GMM with the DP prior, the more data points the more clusters one should obtain, and the method is theoretically not really geared towards determining the 'true' number of components in the distribution. However, the method can also be regarded from a regularization perspective, where a prior distribution on the component weights can be used to constrain model flexibility. That is, for particular choices of the hyperparameter governing the DP prior (denoted as gamma), the model fitting procedure will allow more or less components to be active in the mixture. For low values of gamma, the model fitting procedure effectively penalizes the number of high weight components in the mixture. In `wgd` we provide plots of the mixture with associated weights for each component such that the user can visually discern whether some component is active or inactive (negligible weight) in the mixture. For example, using the same distribution as in the previous paragraph, a mixture with 5 components looks like this: .. image:: bgmms.svg.png Here we see that the fifth component with mean 0.07 has negligible weight compared to the other components in the mixture, which agrees with the results from above. Reference ========= .. automodule:: wgd.modeling :members: :private-members: :special-members: __init__ .. |Ks| replace:: K\ :sub:`S` .. |Ka| replace:: K\ :sub:`A`