Ks distribution module


Copyright (C) 2018 Arthur Zwaenepoel

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

Contact: arzwa@psb.vib-ugent.be


wgd.ks_distribution._calculate_weighted_ks(clustering, pairwise_estimates, pairwise_distances=None, family_id=None)

Calculate weighted Ks, Kn and w values following the procedure as outlined in Vanneste et al. (2013)

Parameters:
  • clustering – clustering results as produced with _weighting(), can be by average linkage clustering or phylogenetic means (phyml, fasttree)
  • pairwise_estimates – pairwise Ks, Kn and w estimates as produced by codeml.Codeml.run_codeml()
Returns:

a pandas data frame of weighted Ks, Kn and w values.

wgd.ks_distribution._calculate_weights(clustering, pairwise_estimates, pairwise_distances=None)

This is a patch for weight calculation in the pairwise approach

Parameters:
  • clustering – clustering array
  • pairwise_estimates – pairwise Ks estimates array
  • pairwise_distances – pairwise distances array
Returns:

data frame

wgd.ks_distribution._get_nucleotide_sequences(family, nucleotide)

Get nucleotide sequences for codon alignment with PRANK

Parameters:
  • family – sequence dictionary for gene family
  • nucleotide – nucleotide sequences
Returns:

nucleotide sequences for gene family

wgd.ks_distribution._weighting(pairwise_estimates, msa=None, method='alc')

Wrapper for different weighting methods. The fastest method is average linkage clustering based on Ks values, other methods use phylogenetic trees (phyml or fasttree)

Parameters:
  • pairwise_estimates – results dictionary from codeml.Codeml.run() output
  • msa – protein multiple sequence alignment file path (optional, not necessary for method=``alc``)
  • method – method, one of alc|phyml|fasttree
Returns:

clustering data structure and pairwise leaf distances (not if method=``alc``)

wgd.ks_distribution.add_alignment_stats(df, stats, l, l_stripped)

Add alignment statistics to the data frame

Parameters:
  • df – pandas data frame
  • stats – stats dict see alignment.pairwise_alignment_stats()
  • l – alignment length
  • l_stripped – stripped alignment length
Returns:

data frame

wgd.ks_distribution.analyse_family(family_id, family, nucleotide, tmp='./', codeml='codeml', preserve=False, times=1, min_length=100, method='alc', aligner='muscle', output_dir='./out', **kwargs)

Wrapper function for the analysis of one paralog family. Performs alignment with alignment.MSA.run_aligner() and codeml analysis with codeml.Codeml.run_codeml(). Subsequently also clustering with _average_linkage_clustering() is performed and weighted Ks, Kn and w values are calculated using _calculate_weighted_ks().

Parameters:
  • family_id – gene family id
  • family – dictionary with sequences of paralogs
  • nucleotide – nucleotide (CDS) sequences dictionary
  • tmp – tmp directory
  • codeml – codeml path
  • preserve – preserve intermediate files
  • times – number of times to perform ML estimation of Ks, Ka and omega values
  • method – weighting method, from fast to slow: alc, fasttree, phyml
  • aligner – alignment program
  • output_dir – output directory
Returns:

csv file with results for the paralog family of interest

wgd.ks_distribution.analyse_family_pairwise(family_id, family, nucleotide, tmp='./', codeml='codeml', preserve=False, times=1, min_length=100, method='alc', aligner='muscle', output_dir='./out')

Perform Ks analysis for one gene family using teh pairwise approach. The pairwise analysis approach is the one most commonly used in the literature, e.g. Vanneste et al. (2013), Vanneste et al. (2015), Barker et al. (2008), Li et al. (2015), Maere et al. (2005) and many more. Implementation details differ for many publications, and their are several implementation choices that are non-trivial. A main point of importance is the weighting of multiple pairwise Ks estimates based on a phylogeny (or proxy thereof), with some authors that do not weight (naive pairwise Ks distributions). Some authors use phylogenetic trees, oter use hierarchical clustering for the weighting. Lastly some authors perform reweighting after outlier removal, others don’t. Here the approach of Vanneste et al. (2013) and later papers is largely followed:

(1) A multiple sequence alignment at the protein level is constructed for a full paralogous family.

(2) For every pair of sequences in the multiple sequence alignment (i.e. n*(n-1)/2 pairs for a family of n members), strip all gaps in the pairwise alignment.

(3) Then back-translate the sequences to obtain a codon alignment. Discard the sequence pair if the alignment is too short (min_length, default=100).

(4) Use codeml to perform ML estimation of Ks, Ka and Ka/Ks. Here the empirical codon frequencies of the pairwise alignment with the F3x4 method.

(5) A matrix is constructed with all pairwise estimates. Entries that do not have a valid Ks estimate (because the alignment was too short) are removed from the matrix. This is done by iteratively removing the columns (and rows) with most NaN values. These sequences are also removed from the aligment.

(6) A weighting method is applied. Either a phylogenetic tree is constructed for the family or the Pairwise Ks matrix is clustered using average linkage clustering. Every Ks estimate for a duplication node is then weighted by the number of estimates that are present for that node such that the different estimates for the same duplication event result in weight of 1 in total.

(7) Outliers are removed, where outliers are naively defined as pairs with Ks estimates > 5. Outlier removal essentially consists in putting the weight of these pairs to 0 and recalculating the weights for the remaining pairs in the family.

Parameters:
  • family_id – gene family ID
  • family – gene family sequence dictionary
  • nucleotide – nucleotide sequences
  • tmp – tmp directory
  • codeml – codeml executable
  • preserve – preserve codeml, alignment and tree results
  • times – number of times to perform codeml estimation
  • min_length – minimum gap-stripped alignment length to consider
  • method – weighting method
  • aligner – alignment method
  • output_dir – output directory
Returns:

nada

wgd.ks_distribution.ks_analysis_one_vs_one(nucleotide_sequences, protein_sequences, gene_families, tmp_dir='./tmp', output_dir='./ks.out', codeml_path='codeml', aligner='muscle', preserve=True, times=1, n_threads=4, **kwargs)

Calculate a Ks distribution for one vs. one orthologs.

Parameters:
  • nucleotide_sequences – sequence dictionary
  • protein_sequences – protein sequence dictionary
  • paralogs – file with paralog families
  • tmp_dir – tmp directory
  • output_dir – output directory
  • codeml_path – path to codeml executable
  • preserve – preserve intermediate results (muscle, codeml)
  • times – number of times to perform codeml analysis
  • ignore_prefixes – ignore prefixes in paralog/gene family file (e.g. in ath|AT1G45000, ath| will be ignored)
  • min_length – minimum MSA length
  • aligner – aligner to use (muslce|prank)
  • n_threads – number of CPU cores to use
Returns:

data frame

wgd.ks_distribution.ks_analysis_paranome(nucleotide_sequences, protein_sequences, paralogs, tmp_dir='./tmp', output_dir='./ks.out', codeml_path='codeml', preserve=True, times=1, ignore_prefixes=False, n_threads=4, min_length=100, method='alc', aligner='muscle', pairwise=False, max_pairwise=10000, **kwargs)

Calculate a Ks distribution for a whole paranome.

Parameters:
  • nucleotide_sequences – sequence dictionary
  • protein_sequences – protein sequence dictionary
  • paralogs – file with paralog families
  • tmp_dir – tmp directory
  • output_dir – output directory
  • codeml_path – path to codeml executable
  • preserve – preserve intermediate results (muscle, codeml)
  • times – number of times to perform codeml analysis
  • ignore_prefixes – ignore prefixes in paralog/gene family file (e.g. in ath|AT1G45000, ath| will be ignored)
  • min_length – minimum MSA length
  • method – method to use, from fast to slow: alc, fasttree, phyml
  • aligner – alignment program to use (muscle|prank)
  • n_threads – number of CPU cores to use:
  • pairwise – perform pairwise (instead of gene family wise) analysis
  • max_pairwise – maximum number of pairwise combinations a family can have.
Returns:

data frame

wgd.ks_distribution.sort_families_by_size(families, pairwise=False, max_pairwise=10000)

Returns a list of non-singleton gene families ordered by size and apply some filters

Parameters:
  • families – nested gene family dictionary {family: {gene: sequence}}
  • pairwise – pairwise analysis
  • max_pairwise – maximum number of pairwise combinations a family may have
Returns:

list of tuples [(family id, size)] sorted by size