Various utilities¶
Probably not of specific interest.
Copyright (C) 2018 Arthur Zwaenepoel
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.
Contact: arzwa@psb.vib-ugent.be
Every package has it’s utils module, right?
-
class
wgd.utils.
Genome
¶ Class that represents a structural annotation. Collects several nice data structures for a genome and parsers for various genomic data file formats (e.g. gff, fasta, …)
-
__init__
()¶ Genome.genome: dictionary with a full representation Genome.gene_lists: dictionary with ordered lists of genes per chromosome (as for I-ADHoRe)
-
karyotype_json
(out_file='genome.json')¶ Generate karyotype data file in json format (as per Circos.js/d3.js)
Parameters: out_file – output file name
-
parse_plaza_gff
(gff_file, keyword='mRNA', id_string='Parent')¶ Parse a PLAZA annotation file into a genome dictionary
Parameters: - gff_file – input gff (PLAZA style)
- keyword – keyword for elements to parse out
- id_string – keyword for retrieving the gene ID from the 9th column
-
-
wgd.utils.
_random_color
()¶ Generate a random hex color
-
wgd.utils.
can_i_run_software
(software)¶ Test if external software is executable
Parameters: software – list or string of executable(s) Returns: 1 (failure) or 0 (success)
-
wgd.utils.
check_dirs
(tmp_dir, output_dir, prompt, preserve)¶ Check directories needed
Parameters: - tmp_dir – tmp directory
- output_dir – output directory
- prompt – prompt for overwrites (boolean)?
- preserve – preserve MSA files (boolean)?
Returns: nothing
-
wgd.utils.
filter_one_vs_one_families
(gene_families, s1, s2)¶ Filter one-vs-one ortholog containing families for two given species.
Parameters: - gene_families – gene families fil in raw MCL format, see
process_gene_families()
- s1 – species 1 prefix
- s2 – species 2 prefix
Returns: one-vs.-one ortholog containing gene families.
- gene_families – gene families fil in raw MCL format, see
-
class
wgd.utils.
gaussian_kde
(dataset, bw_method=None, weights=None)¶ from: https://stackoverflow.com/questions/27623919/weighted-gaussian-kernel- density-estimation-in-python
Representation of a kernel-density estimate using Gaussian kernels.
Kernel density estimation is a way to estimate the probability density function (PDF) of a random variable in a non-parametric way. gaussian_kde works for both uni-variate and multi-variate data. It includes automatic bandwidth determination. The estimation works best for a unimodal distribution; bimodal or multi-modal distributions tend to be oversmoothed.
- dataset : array_like
- Datapoints to estimate from. In case of univariate data this is a 1-D array, otherwise a 2-D array with shape (# of dims, # of data).
- bw_method : str, scalar or callable, optional
- The method used to calculate the estimator bandwidth. This can be ‘scott’, ‘silverman’, a scalar constant or a callable. If a scalar, this will be used directly as kde.factor. If a callable, it should take a gaussian_kde instance as only parameter and return a scalar. If None (default), ‘scott’ is used. See Notes for more details.
- weights : array_like, shape (n, ), optional, default: None
- An array of weights, of the same shape as x. Each value in x only contributes its associated weight towards the bin count (instead of 1).
- dataset : ndarray
- The dataset with which gaussian_kde was initialized.
- d : int
- Number of dimensions.
- n : int
- Number of datapoints.
- neff : float
- Effective sample size using Kish’s approximation.
- factor : float
- The bandwidth factor, obtained from kde.covariance_factor, with which the covariance matrix is multiplied.
- covariance : ndarray
- The covariance matrix of dataset, scaled by the calculated bandwidth (kde.factor).
- inv_cov : ndarray
- The inverse of covariance.
- kde.evaluate(points) : ndarray
- Evaluate the estimated pdf on a provided set of points.
- kde(points) : ndarray
- Same as kde.evaluate(points)
- kde.pdf(points) : ndarray
- Alias for
kde.evaluate(points)
. - kde.set_bandwidth(bw_method=’scott’) : None
- Computes the bandwidth, i.e. the coefficient that multiplies the data covariance matrix to obtain the kernel covariance matrix. .. versionadded:: 0.11.0
- kde.covariance_factor : float
- Computes the coefficient (kde.factor) that multiplies the data covariance matrix to obtain the kernel covariance matrix. The default is scotts_factor. A subclass can overwrite this method to provide a different method, or set it through a call to kde.set_bandwidth.
Bandwidth selection strongly influences the estimate obtained from the KDE (much more so than the actual shape of the kernel). Bandwidth selection can be done by a “rule of thumb”, by cross-validation, by “plug-in methods” or by other means; see [3], [4] for reviews. gaussian_kde uses a rule of thumb, the default is Scott’s Rule.
Scott’s Rule [1], implemented as scotts_factor, is:
n**(-1./(d+4)),
with
n
the number of data points andd
the number of dimensions. Silverman’s Rule [2], implemented as silverman_factor, is:(n * (d + 2) / 4.)**(-1. / (d + 4)).
Good general descriptions of kernel density estimation can be found in [1] and [2], the mathematics for this multi-dimensional implementation can be found in [1].
[1] (1, 2, 3) D.W. Scott, “Multivariate Density Estimation: Theory, Practice, and Visualization”, John Wiley & Sons, New York, Chicester, 1992. [2] (1, 2) B.W. Silverman, “Density Estimation for Statistics and Data Analysis”, Vol. 26, Monographs on Statistics and Applied Probability, Chapman and Hall, London, 1986. [3] B.A. Turlach, “Bandwidth Selection in Kernel Density Estimation: A Review”, CORE and Institut de Statistique, Vol. 19, pp. 1-33, 1993. [4] D.M. Bashtannyk and R.J. Hyndman, “Bandwidth selection for kernel conditional density estimation”, Computational Statistics & Data Analysis, Vol. 36, pp. 279-298, 2001. Generate some random two-dimensional data:
>>> from scipy import stats >>> def measure(n): >>> "Measurement model, return two coupled measurements." >>> m1 = np.random.normal(size=n) >>> m2 = np.random.normal(scale=0.5, size=n) >>> return m1+m2, m1-m2
>>> m1, m2 = measure(2000) >>> xmin = m1.min() >>> xmax = m1.max() >>> ymin = m2.min() >>> ymax = m2.max()
Perform a kernel density estimate on the data:
>>> X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j] >>> positions = np.vstack([X.ravel(), Y.ravel()]) >>> values = np.vstack([m1, m2]) >>> kernel = stats.gaussian_kde(values) >>> Z = np.reshape(kernel(positions).T, X.shape)
Plot the results:
>>> import matplotlib.pyplot as plt >>> fig = plt.figure() >>> ax = fig.add_subplot(111) >>> ax.imshow(np.rot90(Z), cmap=plt.cm.gist_earth_r, ... extent=[xmin, xmax, ymin, ymax]) >>> ax.plot(m1, m2, 'k.', markersize=2) >>> ax.set_xlim([xmin, xmax]) >>> ax.set_ylim([ymin, ymax]) >>> plt.show()
-
__init__
(dataset, bw_method=None, weights=None)¶ Initialize self. See help(type(self)) for accurate signature.
-
_compute_covariance
()¶ Computes the covariance matrix for each Gaussian kernel using covariance_factor().
-
evaluate
(points)¶ Evaluate the estimated pdf on a set of points.
- points : (# of dimensions, # of points)-array
- Alternatively, a (# of dimensions,) vector can be passed in and treated as a single point.
- values : (# of points,)-array
- The values at each point.
- ValueError : if the dimensionality of the input points is different than
- the dimensionality of the KDE.
-
set_bandwidth
(bw_method=None)¶ Compute the estimator bandwidth with given method.
The new bandwidth calculated after a call to set_bandwidth is used for subsequent evaluations of the estimated density.
- bw_method : str, scalar or callable, optional
- The method used to calculate the estimator bandwidth. This can be ‘scott’, ‘silverman’, a scalar constant or a callable. If a scalar, this will be used directly as kde.factor. If a callable, it should take a gaussian_kde instance as only parameter and return a scalar. If None (default), nothing happens; the current kde.covariance_factor method is kept.
New in version 0.11.
>>> x1 = np.array([-7, -5, 1, 4, 5.]) >>> kde = stats.gaussian_kde(x1) >>> xs = np.linspace(-10, 10, num=50) >>> y1 = kde(xs) >>> kde.set_bandwidth(bw_method='silverman') >>> y2 = kde(xs) >>> kde.set_bandwidth(bw_method=kde.factor / 3.) >>> y3 = kde(xs)
>>> fig = plt.figure() >>> ax = fig.add_subplot(111) >>> ax.plot(x1, np.ones(x1.shape) / (4. * x1.size), 'bo', ... label='Data points (rescaled)') >>> ax.plot(xs, y1, label='Scott (default)') >>> ax.plot(xs, y2, label='Silverman') >>> ax.plot(xs, y3, label='Const (1/3 * Silverman)') >>> ax.legend() >>> plt.show()
-
wgd.utils.
get_gfs_for_species
(gene_family_dict, gene_pattern)¶ Get non-singleton gene families for a species of interest
Parameters: - gene_family_dict – gene family dictionary
- species – species of interest
Returns: dictionairy with gene families
-
wgd.utils.
get_paralogs_fasta
(input_fasta, selected_paralogs, output_fasta, pairs=False)¶ Get the fasta file associated with the paralogs in a slice from a Ks distribution data frame
Parameters: - input_fasta – fasta file
- selected_paralogs – Data frame (slice)
- output_fasta – output fasta file
Returns: nada
-
wgd.utils.
get_sequences
(paralog_dict, sequences)¶ Fetch sequences from a fasta file or sequence dict and put them in a two level dictionairy {gene_family: {gene: seq, gene: seq, …}, …}
Parameters: paralog_dict – Returns: two-level dictionairy
-
wgd.utils.
log_subprocess
(program, process)¶ Log output from a subprocess call to debug log stream
Parameters: - program – program name
- process – completed subprocess object
-
wgd.utils.
process_gene_families
(gene_family_file, ignore_prefix=False)¶ Processes a raw gene family file as e.g. from MCL output into a generic dictionary structure. MCL raw file consists of one gene family per line, including tab separated gene IDs, (without gene family ID !).
Example:
gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9 gene10 gene11 gene12
Parameters: - gene_family_file – file in the right raw gene family format (see above)
- ignore_prefix – ignore prefixes (boolean), if the gene contains a ‘|’
character in default mode the part preceding the ‘|’ is trimmed of,
assuming the second part is the gene ID. If
ignore prefix = True
this behavior is suppressed. So withignore_prefix = False
ath|AT1G10000
becomesAT1G10000
after processing, withignore_prefix = True
it remainsath|AT1G10000
.
-
wgd.utils.
read_fasta
(fasta_file, prefix=None, split_on_pipe=False, split_on_whitespace=True, raw=False)¶ Generic fasta file reader. Returns a dictionairy
{ID: sequence, ID: sequence, ...}
.Parameters: - fasta_file – fasta file
- prefix – prefix to add to gene IDs
- split_on_pipe – boolean, split gene IDs on ‘|’ character
- split_on_whitespace – boolean, split gene IDs on whitespace
- raw – boolean, return raw fasta file (why would you want this?)
Returns: sequence dictionary
-
wgd.utils.
translate_cds
(sequence_dict, skip_invalid=False)¶ Just another CDS to protein translater. Will give warnings when in-frame stop codons are found, invalid codons are found, or when the sequence length is not a multiple of three. Will translate the full sequence or until an unspecified or in-frame codon is encountered.
Parameters: - sequence_dict – dictionary with gene IDs and CDS sequences
- skip_invalid – bool, skip invalid CDS? (default translates to first stop codon or end)
Returns: dictionary with gene IDs and proteins sequences
-
wgd.utils.
uniq_id
()¶ Get a unique ID which is not crazy long
Returns: string
-
wgd.utils.
write_fasta
(seq_dict, output_file)¶ Write a sequence dictionary to a fasta file.
Parameters: - seq_dict – sequence dictionary, see
read_fasta()
- output_file – output file name
- seq_dict – sequence dictionary, see