Phylogenetic tools module


Copyright (C) 2018 Arthur Zwaenepoel

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

Contact: arzwa@psb.vib-ugent.be


Methods related to phylogenetic tree construction and processing. Mainly for the node-weighting approach used in constructing whole paranome Ks distributions, where currently weighting through average linkage clustering, FastTree ML trees and PhyML trees is supported.

wgd.phy.average_linkage_clustering(pairwise_estimates)

Perform average linkage clustering using fastcluster. The first two columns of the output contain the node indices which are joined in each step. The input nodes are labeled 0, … , N - 1, and the newly generated nodes have the labels N, … , 2N - 2. The third column contains the distance between the two nodes at each step, ie. the current minimal distance at the time of the merge. The fourth column counts the number of points which comprise each new node.

Parameters:pairwise_estimates – dictionary with data frames with pairwise estimates of Ks, Ka and Ka/Ks (or at least Ks), as returned by analyse_family().
Returns:average linkage clustering as performed with fastcluster.average.
wgd.phy.phylogenetic_tree_to_cluster_format(tree, pairwise_estimates)

Convert a phylogenetic tree to a ‘cluster’ data structure as in fastcluster. The first two columns indicate the nodes that are joined by the relevant node, the third indicates the distance (calculated from branch lengths in the case of a phylogenetic tree) and the fourth the number of leaves underneath the node. Note that the trees are rooted using midpoint-rooting.

Example of the data structure (output from fastcluster):

[[   3.            7.            4.26269776    2.        ]
 [   0.            5.           26.75703595    2.        ]
 [   2.            8.           56.16007598    2.        ]
 [   9.           12.           78.91813609    3.        ]
 [   1.           11.           87.91756528    3.        ]
 [   4.            6.           93.04790855    2.        ]
 [  14.           15.          114.71302639    5.        ]
 [  13.           16.          137.94616373    8.        ]
 [  10.           17.          157.29055403   10.        ]]
Parameters:
  • tree – newick tree file
  • pairwise_estimates – pairwise Ks estimates data frame (pandas) (only the index is used)
Returns:

clustering data structure, pairwise distances dictionary

wgd.phy.run_fasttree(msa, fasttree_path='FastTree')

Run FastTree on a protein multiple sequence alignment

Parameters:
  • msa – file path to protein multiple sequence alignment in multifasta format
  • fasttree_path – path to FastTree executable
Returns:

path to the tree file

wgd.phy.run_phyml(msa, phyml_path='phyml')

Run PhyML on a protein multiple sequence alignment

Parameters:
  • msa – file path to protein multiple sequence alignment in multifasta format
  • phyml_path – path to phyml executable
Returns:

path to the tree file

wgd.phy.write_sequential_phyml(sequence_dict, output_file)

Write a multiple sequence alignment in sequential format (e.g. for PhyML)

Parameters:
  • sequence_dict – sequence dictionary
  • output_file – filename