Ks distribution module¶

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

Contact: arzwa@psb.vib-ugent.be

wgd.ks_distribution._calculate_weighted_ks(clustering, pairwise_estimates, pairwise_distances=None, family_id=None)¶

Calculate weighted Ks, Kn and w values following the procedure as outlined in Vanneste et al. (2013)

Parameters:	clustering – clustering results as produced with `_weighting()`, can be by average linkage clustering or phylogenetic means (phyml, fasttree) pairwise_estimates – pairwise Ks, Kn and w estimates as produced by `codeml.Codeml.run_codeml()`
Returns:	a pandas data frame of weighted Ks, Kn and w values.

wgd.ks_distribution._calculate_weights(clustering, pairwise_estimates, pairwise_distances=None)¶

This is a patch for weight calculation in the pairwise approach

Parameters:	clustering – clustering array pairwise_estimates – pairwise Ks estimates array pairwise_distances – pairwise distances array
Returns:	data frame

wgd.ks_distribution._get_nucleotide_sequences(family, nucleotide)¶

Get nucleotide sequences for codon alignment with PRANK

Parameters:	family – sequence dictionary for gene family nucleotide – nucleotide sequences
Returns:	nucleotide sequences for gene family

wgd.ks_distribution._weighting(pairwise_estimates, msa=None, method='alc')¶

Wrapper for different weighting methods. The fastest method is average linkage clustering based on Ks values, other methods use phylogenetic trees (phyml or fasttree)

Parameters:	pairwise_estimates – results dictionary from `codeml.Codeml.run()` output msa – protein multiple sequence alignment file path (optional, not necessary for method=``alc``) method – method, one of `alc\|phyml\|fasttree`
Returns:	clustering data structure and pairwise leaf distances (not if method=``alc``)

wgd.ks_distribution.add_alignment_stats(df, stats, l, l_stripped)¶

Add alignment statistics to the data frame

Parameters:	df – pandas data frame stats – stats dict see `alignment.pairwise_alignment_stats()` l – alignment length l_stripped – stripped alignment length
Returns:	data frame

wgd.ks_distribution.analyse_family(family_id, family, nucleotide, tmp='./', codeml='codeml', preserve=False, times=1, min_length=100, method='alc', aligner='muscle', output_dir='./out', **kwargs)¶

Wrapper function for the analysis of one paralog family. Performs alignment with alignment.MSA.run_aligner() and codeml analysis with codeml.Codeml.run_codeml(). Subsequently also clustering with _average_linkage_clustering() is performed and weighted Ks, Kn and w values are calculated using _calculate_weighted_ks().

Parameters:

family_id – gene family id
family – dictionary with sequences of paralogs
nucleotide – nucleotide (CDS) sequences dictionary
tmp – tmp directory
codeml – codeml path
preserve – preserve intermediate files
times – number of times to perform ML estimation of Ks, Ka and omega values
method – weighting method, from fast to slow: alc, fasttree, phyml
aligner – alignment program
output_dir – output directory

Returns:

csv file with results for the paralog family of interest

wgd.ks_distribution.analyse_family_pairwise(family_id, family, nucleotide, tmp='./', codeml='codeml', preserve=False, times=1, min_length=100, method='alc', aligner='muscle', output_dir='./out')¶

Perform Ks analysis for one gene family using teh pairwise approach. The pairwise analysis approach is the one most commonly used in the literature, e.g. Vanneste et al. (2013), Vanneste et al. (2015), Barker et al. (2008), Li et al. (2015), Maere et al. (2005) and many more. Implementation details differ for many publications, and their are several implementation choices that are non-trivial. A main point of importance is the weighting of multiple pairwise Ks estimates based on a phylogeny (or proxy thereof), with some authors that do not weight (naive pairwise Ks distributions). Some authors use phylogenetic trees, oter use hierarchical clustering for the weighting. Lastly some authors perform reweighting after outlier removal, others don’t. Here the approach of Vanneste et al. (2013) and later papers is largely followed:

(1) A multiple sequence alignment at the protein level is constructed for a full paralogous family.

(2) For every pair of sequences in the multiple sequence alignment (i.e. n*(n-1)/2 pairs for a family of n members), strip all gaps in the pairwise alignment.

(3) Then back-translate the sequences to obtain a codon alignment. Discard the sequence pair if the alignment is too short (min_length, default=100).

(4) Use codeml to perform ML estimation of Ks, Ka and Ka/Ks. Here the empirical codon frequencies of the pairwise alignment with the F3x4 method.

(5) A matrix is constructed with all pairwise estimates. Entries that do not have a valid Ks estimate (because the alignment was too short) are removed from the matrix. This is done by iteratively removing the columns (and rows) with most NaN values. These sequences are also removed from the aligment.

(6) A weighting method is applied. Either a phylogenetic tree is constructed for the family or the Pairwise Ks matrix is clustered using average linkage clustering. Every Ks estimate for a duplication node is then weighted by the number of estimates that are present for that node such that the different estimates for the same duplication event result in weight of 1 in total.

(7) Outliers are removed, where outliers are naively defined as pairs with Ks estimates > 5. Outlier removal essentially consists in putting the weight of these pairs to 0 and recalculating the weights for the remaining pairs in the family.

Parameters:

family_id – gene family ID
family – gene family sequence dictionary
nucleotide – nucleotide sequences
tmp – tmp directory
codeml – codeml executable
preserve – preserve codeml, alignment and tree results
times – number of times to perform codeml estimation
min_length – minimum gap-stripped alignment length to consider
method – weighting method
aligner – alignment method
output_dir – output directory

Returns:

nada

wgd.ks_distribution.ks_analysis_one_vs_one(nucleotide_sequences, protein_sequences, gene_families, tmp_dir='./tmp', output_dir='./ks.out', codeml_path='codeml', aligner='muscle', preserve=True, times=1, n_threads=4, **kwargs)¶

Calculate a Ks distribution for one vs. one orthologs.

Parameters:

nucleotide_sequences – sequence dictionary
protein_sequences – protein sequence dictionary
paralogs – file with paralog families
tmp_dir – tmp directory
output_dir – output directory
codeml_path – path to codeml executable
preserve – preserve intermediate results (muscle, codeml)
times – number of times to perform codeml analysis
ignore_prefixes – ignore prefixes in paralog/gene family file (e.g. in ath|AT1G45000, ath| will be ignored)
min_length – minimum MSA length
aligner – aligner to use (muslce|prank)
n_threads – number of CPU cores to use

Returns:

data frame

wgd.ks_distribution.ks_analysis_paranome(nucleotide_sequences, protein_sequences, paralogs, tmp_dir='./tmp', output_dir='./ks.out', codeml_path='codeml', preserve=True, times=1, ignore_prefixes=False, n_threads=4, min_length=100, method='alc', aligner='muscle', pairwise=False, max_pairwise=10000, **kwargs)¶

Calculate a Ks distribution for a whole paranome.

Parameters:

nucleotide_sequences – sequence dictionary
protein_sequences – protein sequence dictionary
paralogs – file with paralog families
tmp_dir – tmp directory
output_dir – output directory
codeml_path – path to codeml executable
preserve – preserve intermediate results (muscle, codeml)
times – number of times to perform codeml analysis
ignore_prefixes – ignore prefixes in paralog/gene family file (e.g. in ath|AT1G45000, ath| will be ignored)
min_length – minimum MSA length
method – method to use, from fast to slow: alc, fasttree, phyml
aligner – alignment program to use (muscle|prank)
n_threads – number of CPU cores to use:
pairwise – perform pairwise (instead of gene family wise) analysis
max_pairwise – maximum number of pairwise combinations a family can have.

Returns:

data frame

wgd.ks_distribution.sort_families_by_size(families, pairwise=False, max_pairwise=10000)¶

Returns a list of non-singleton gene families ordered by size and apply some filters

Parameters:	families – nested gene family dictionary {family: {gene: sequence}} pairwise – pairwise analysis max_pairwise – maximum number of pairwise combinations a family may have
Returns:	list of tuples [(family id, size)] sorted by size

Ks distribution module¶

Previous topic

Next topic

This Page