ugtm package

Submodules

ugtm.ugtm_classes module

Defines classes for initial and optimized GTM model.

class ugtm.ugtm_classes.InitialGTM(matX, matM, n_nodes, n_rbf_centers, rbfWidth, matPhiMPlusOne, matW, matY, betaInv, n_dimensions)

Bases: object

Class for initial GTM model.

Parameters:
  • matX (array of shape (n_nodes, 2)) – Coordinates of nodes defining a grid in the 2D space.
  • matM (array of shape (n_rbf_centers, 2)) – Coordinates of radial basis function (RBF) centers, defining a grid in the 2D space.
  • n_nodes (int) – The number of nodes defining a grid in the 2D space.
  • n_rbf_centers (int) – The number of radial basis function (RBF) centers.
  • rbfWidth (float) – Initial radial basis function (RBF) width. This is set to the average of the minimum distance between RBF centers: \(rbfWidth=\sigma \times average(\mathbf{distances(rbf)}_{min})\), where \(sigma\) is the GTM hyperparameter s. NB: if GTM hyperparameter s = 0 (not recommended), rbfWidth is set to the maximum distance between RBF centers.
  • matPhiMPlusOne (array of shape (n_nodes, n_rbf_centers+1)) – RBF matrix plus one dimension to include a term for bias.
  • matW (array of shape (n_dimensions, n_rbf_centers+1)) – Parameter matrix (PCA-initialized).
  • matY (array of shape (n_dimensions, n_nodes)) – Manifold in n-dimensional space (projection of matX in data space); A point matY[:,i] is a center of Gaussian component in data space. \(\mathbf{Y}=\mathbf{W}\mathbf{\Phi}^T\)
  • betaInv (float) – Noise variance parameter for the data distribution. Written as \(\beta^{-1}\) in the original paper. Initialized to be the larger between: (1) the 3rd eigenvalue of the data covariance matrix, (2) half the average distance between Gaussian component centers in the data space (matY matrix).
  • n_dimensions (int) – Data space dimensionality (number of variables).
class ugtm.ugtm_classes.OptimizedGTM(matW, matY, matP, matR, betaInv, matMeans, matModes, matX, n_dimensions, converged)

Bases: object

Class for optimized GTM model.

Variables:
  • matX (array of shape (n_nodes, 2)) – Coordinates of nodes defining a grid in the 2D space.
  • matW (array of shape (n_dimensions, n_rbf_centers+1)) – Parameter matrix (PCA-initialized).
  • matY (array of shape (n_dimensions, n_nodes)) – Manifold in n-dimensional space (projection of matX in data space). matY = np.dot(matW, np.transpose(matPhiMPlusOne))
  • matP (array of shape (n_individuals, n_nodes)) – Data distribution with variance betaInv.
  • matR (array of shape (n_individuals, n_nodes)) – Responsibilities (posterior probabilities), used to compute data representations: means (matMeans) and modes (matModes). Responsibilities are the main output of GTM. matR[i,:] represents the responsibility vector for an instance i. The columns in matR correspond to rows in matX (nodes).
  • betaInv (float) – Noise variance parameter for the data distribution. Written as \(\beta^{-1}\) in the original paper.
  • matMeans (array of shape (n_individuals, 2)) – Data representation in 2D space: means (most commonly used for GTM).
  • matModes (array of shape(n_individuals, 2)) – Data representation in 2D space: modes (for each instance, coordinate with highest responsibility).
  • n_dimensions (int) – Data space dimensionality (number of variables).
  • converged (bool) – True if the model has converged; otherwise False.
plot(labels=None, title='', output='output', discrete=False, pointsize=1, alpha=0.3, cname='Spectral_r', output_format='pdf')

Simple plotting function for GTM object.

Parameters:
  • labels (array of shape (n_individuals,), optional (default = None)) – Data labels.
  • title (str, optional (default = ‘’)) – Plot title.
  • output (str, optional (default = ‘ouptut’)) – Output path for plot.
  • discrete (bool (default = False)) – Type of label; discrete=True if labels are nominal or binary.
  • pointsize (float, optional (default = ‘1’)) – Marker size.
  • alpha (float, optional (default = ‘0.3’),) – Marker transparency.
  • cname (str, optional (default = ‘Spectral_r’),) – Name of matplotlib color map. Cf. https://matplotlib.org/examples/color/colormaps_reference.html
  • output_format ({‘pdf’, “png’, ‘ps’, ‘eps’, ‘svg’}) – Output format for GTM plot.
Returns:

Basic GTM plotting function. Points mean GTM representation for each data point.

Return type:

Image file

Notes

This function plots mean representations only (no landscape nor modes).

plot_html(labels=None, ids=None, plot_arrows=True, title='GTM', discrete=False, output='output', pointsize=1.0, alpha=0.3, do_interpolate=True, cname='Spectral_r', prior='estimated')

Plotting function for GTM object - HTML output.

Parameters:
  • labels (array of shape (n_individuals,), optional (default = None)) – Data labels.
  • ids (array of shape (n_individuals,), optional (default = None)) – Identifiers for each data point - appears in tooltips.
  • title (str, optional (default = ‘’)) – Plot title.
  • output (str, optional (default = ‘ouptut’)) – Output path for plot.
  • discrete (bool (default = False)) – Type of label; discrete=True if labels are nominal or binary.
  • pointsize (float, optional (default = ‘1’)) – Marker size.
  • alpha (float, optional (default = ‘0.3’)) – Marker transparency.
  • do_interpolate (bool, optional (default = True)) – Interpolate color between grid nodes.
  • cname (str, optional (default = ‘Spectral_r’)) – Name of matplotlib color map.
  • prior ({‘estimated’,’equiprobable’}) – Type of prior used to compute class probabilities on the map. Choose equiprobable for equiprobable priors. Estimated priors take into account class imbalance.
Returns:

HTML file of GTM output (mean coordinates). If labels are provided, a landscape (continuous or discrete depending on the labels) is drawn in the background. This landscape is computed using responsibilities and is indicative of the average activity or label value at a given node on the map.

Return type:

HTML file

Notes

May be time-consuming for large datasets.

plot_html_projection(projections, labels=None, ids=None, plot_arrows=True, title='GTM_projection', discrete=False, output='output', pointsize=1.0, alpha=0.3, do_interpolate=True, cname='Spectral_r', prior='estimated')

Returns a GTM landscape with projected data points - HTML output.

Parameters:
  • projections (instance of ugtm.ugtm_classes.OptimizedGTM) – OptimizedGTM model.
  • labels (array of shape (n_train,), optional (default = None)) – Data labels for the training set (not projections).
  • ids (array of shape (n_test,), optional (default = None)) – Identifiers for each projected data point - appears in tooltips.
  • title (str, optional (default = ‘’)) – Plot title.
  • output (str, optional (default = ‘ouptut’)) – Output path for plot.
  • discrete (bool (default = False)) – Type of label; discrete=True if labels are nominal or binary.
  • pointsize (float, optional (default = ‘1’)) – Marker size.
  • alpha (float, optional (default = ‘0.3’)) – Marker transparency.
  • do_interpolate (bool, optional (default = True)) – Interpolate color between grid nodes.
  • cname (str, optional (default = ‘Spectral_r’)) – Name of matplotlib color map.
  • prior ({‘estimated’,’equiprobable’}) – Type of prior used to compute class probabilities on the map. Choose equiprobable for equiprobable priors. Estimated priors take into account class imbalance.
Returns:

HTML file of GTM output (mean coordinates). This function plots a GTM model (from a training set) with projected points (= mean positions of projected test data). If labels are provided, a landscape (continuous or discrete depending on the labels) is drawn in the background. This landscape is computed using responsibilities and is indicative of the average activity or label value at a given node on the map.

Return type:

HTML file

Notes

  • May be time-consuming for large datasets.
  • The labels correspond to training data (the optimized model).
  • The ids (identifiers) correspond to the test data (projections).
plot_modes(labels=None, title='', output='output', discrete=False, pointsize=1, alpha=0.3, cname='Spectral_r', output_format='pdf')

Simple plotting function for GTM object: plot modes.

Parameters:
  • labels (array of shape (n_individuals,), optional (default = None)) – Data labels.
  • title (str, optional (default = ‘’)) – Plot title.
  • output (str, optional (default = ‘ouptut’)) – Output path for plot.
  • discrete (bool (default = False)) – Type of label; discrete=True if labels are nominal or binary.
  • pointsize (float, optional (default = ‘1’)) – Marker size.
  • alpha (float, optional (default = ‘0.3’)) – Marker transparency.
  • cname (str, optional (default = ‘Spectral_r’)) – Name of matplotlib color map.
  • output_format ({‘png’, ‘pdf’, ‘ps’, ‘eps’, ‘svg’}, default = ‘pdf’) – Output format for GTM plot.
Returns:

Plot of GTM modes (for each data point, node with highest responsibility).

Return type:

Image file

Notes

This function plots mode representations only (no landscape nor means).

plot_multipanel(labels, output='output', discrete=False, pointsize=1.0, alpha=0.3, do_interpolate=True, cname='Spectral_r', prior='estimated')

Multipanel visualization for GTM object - PDF output.

Parameters:
  • labels (array of shape (n_individuals,), optional (default = None)) – Data labels.
  • output (str, optional (default = ‘ouptut’)) – Output path for plot.
  • discrete (bool (default = False)) – Type of label; discrete=True if labels are nominal or binary.
  • pointsize (float, optional (default = ‘1’)) – Marker size.
  • alpha (float, optional (default = ‘0.3’)) – Marker transparency.
  • do_interpolate (bool, optional (default = True)) – Interpolate color between grid nodes.
  • cname (str, optional (default = ‘Spectral_r’)) – Name of matplotlib color map.
  • prior ({‘estimated’,’equiprobable’}) – Type of prior used to compute class probabilities on the map. Choose equiprobable for equiprobable priors. Estimated priors take into account class imbalance.
Returns:

Four plots are returned: (1) means, (2) modes, (3) landscape with means, (4) landscape with modes.

Return type:

PDF image

write(output='output')

Write optimized GTM model: means, modes and responsibilities.

Parameters:output (str, optional (default = ‘output’)) – Output path.
Returns:Separate files for (1) means (mean position for each data point), (2) modes (node with max. responsibility for each data point), (3) responsibilities (posterior probabilities for each data point)
Return type:CSV files
write_all(output='output')

Write optimized GTM model and optimized parameters.

Parameters:output (str, optional (default = ‘output’)) – Output path.
Returns:Separate files for (1) means (mean position for each data point), (2) modes (node with max. responsibility for each data point), (3) responsibilities (posterior probabilities for each data point), (4) initial space dimension and data distribution variance, (5) manifold coordinates (matY), (6) parameter matrix (matW)
Return type:CSV files
class ugtm.ugtm_classes.ReturnU(matU, betaInv)

Bases: object

ugtm.ugtm_core module

Core linear algebra operations for GTM and kGTM

ugtm.ugtm_core.KERNELcreateDistanceMatrix(data, matL, matPhiMPlusOne)

Computes distances between data and manifold for kernel algorithm.

Parameters:
  • data (array of shape (n_individuals, n_dimensions)) – Data matrix.
  • matL (array of shape (n_individuals, n_rbf_centers+1)) – Parameter matrix (regul).
  • matPhiMPlusOne (array of shape (n_nodes, n_rbf_centers+1)) – RBF matrix plus one dimension to include a term for bias.
Returns:

Matrix of distances between manifold and data.

Return type:

array of shape (n_nodes, n_individuals)

ugtm.ugtm_core.computeWidth(matM, numM, sigma)

Initializes radial basis function width using hyperparameter sigma.

Parameters:
  • matM (array of shape (n_rbf_centers, 2)) – Coordinates of radial basis function (RBF) centers, defining a grid in the 2D space.
  • numM (int) – Number of RBF centers (n_rbf_centers)
  • sigma (float) – RBF width factor.
Returns:

Initial radial basis function (RBF) width.

Return type:

float

ugtm.ugtm_core.computelogLikelihood(matP, betaInv, n_dimensions)

Computes log likelihood = GTM objective function

Parameters:
  • matP (array of shape (n_nodes, n_individuals)) – Data distribution with variance betaInv (transformed: exp(x-max(x)))
  • betaInv (float) – Noise variance parameter for the data distribution. Written as \(\beta^{-1}\) in the original paper.
  • n_dimensions (int) – Data space dimensionality (number of variables).
Returns:

Log likelihood.

Return type:

float

ugtm.ugtm_core.createDistanceMatrix(matY, data)

Computes distances between manifold centers and data vectors.

Parameters:
  • matY (array of shape (n_dimensions, n_nodes)) – Manifold in n-dimensional space (projection of matX in data space);
  • data (array of shape (n_individuals, n_dimensions)) – Data matrix.
Returns:

Matrix of squared Euclidean distances between manifold and data.

Return type:

array of shape (n_nodes, n_individuals)

ugtm.ugtm_core.createGMatrix(matR)

Creates the G diagonal matrix from responsibilities (R)

Parameters:matR (array of shape (n_nodes, n_individuals)) – Posterior probabilities (responsibilities).
Returns:Diagonal matrix with elements \(G_{ii}=\sum_{n}^{n\_individuals} R_{in}\).
Return type:array of shape (n_nodes, n_nodes)
ugtm.ugtm_core.createPMatrix(matD, betaInv, n_dimensions)

Computes data distribution matrix = exp(-(parameter)*distances).

Parameters:
  • matD (array of shape (n_nodes, n_individuals)) – Matrix of squared Euclidean distances between manifold and data.
  • betaInv (float) – Noise variance parameter for the data distribution. Written as \(\beta^{-1}\) in the original paper.
  • n_dimensions (int) – Data space dimensionality (number of variables).
Returns:

Data distribution with variance betaInv (transformed: exp(x-max(x)))

Return type:

array of shape (n_nodes, n_individuals)

Notes

Important: this data distribution is not exact per se and is to be used as input for createRMatrix (responsibilities).

ugtm.ugtm_core.createPhiMatrix(matX, matM, numX, numM, sigma)

Creates matrix of RBF functions.

Parameters:
  • matX (array of shape (n_nodes, 2)) – Coordinates of nodes defining a grid in the 2D space.
  • matM (array of shape (n_rbf_centers, 2)) – Coordinates of radial basis function (RBF) centers, defining a grid in the 2D space.
  • numX (int) – Number of nodes (n_nodes).
  • numM (int) – Number of RBF centers (n_rbf_centers)
  • sigma (float) – RBF width factor.
Returns:

RBF matrix plus one dimension to include a term for bias.

Return type:

array of shape (n_nodes, n_rbf_centers+1)

ugtm.ugtm_core.createRMatrix(matP)

Computes responsibilities (posterior probabilities).

Parameters:matP (array of shape (n_nodes, n_individuals)) – Data distribution with variance betaInv (transformed: exp(x-max(x)))
Returns:Posterior probabilities (responsibilities).
Return type:array of shape (n_nodes, n_individuals)
ugtm.ugtm_core.createWMatrix(matX, matPhiMPlusOne, matU, n_dimensions, n_rbf_centers)

Creates PCA-initialized parameter matrix W.

Parameters:
  • matX (array of shape (n_nodes, 2)) – Coordinates of nodes defining a grid in the 2D space.
  • matPhiMPlusOne (array of shape (n_nodes, n_rbf_centers+1)) – RBF matrix plus one dimension to include a term for bias.
  • matU (array of shape (n_dimensions, 2)) – 2 first principal axes of data covariance matrix.
  • n_dimensions (int) – Data space dimensionality (number of variables).
  • n_rbf_centers (int) – Number of RBF centers.
  • sigma (float) – RBF width factor.
Returns:

Parameter matrix W (PCA-initialized).

Return type:

array of shape (n_dimensions, n_rbf_centers+1)

ugtm.ugtm_core.createYMatrix(matW, matPhiMPlusOne)

Updates manifold matrix (Y) using new parameter matrix (W).

Parameters:
  • matW (array of shape (n_dimensions, n_rbf_centers+1)) – Parameter matrix (PCA-initialized).
  • matPhiMPlusOne (array of shape (n_nodes, n_rbf_centers+1)) – RBF matrix plus one dimension to include a term for bias.
Returns:

Manifold in n-dimensional space (projection of matX in data space); A point matY[:,i] is a center of Gaussian component in data space. \(\mathbf{Y}=\mathbf{W}\mathbf{\Phi}^T\)

Return type:

array of shape (n_dimensions, n_nodes)

ugtm.ugtm_core.createYMatrixInit(data, matW, matPhiMPlusOne)

Creates initial manifold matrix (Y).

Parameters:
  • data (array of shape (n_individuals, n_dimensions)) – Data matrix.
  • matW (array of shape (n_dimensions, n_rbf_centers+1)) – Parameter matrix (PCA-initialized).
  • matPhiMPlusOne (array of shape (n_nodes, n_rbf_centers+1)) – RBF matrix plus one dimension to include a term for bias.
Returns:

Manifold in n-dimensional space (projection of matX in data space); A point matY[:,i] is a center of Gaussian component in data space. \(\mathbf{Y}=\mathbf{W}\mathbf{\Phi}^T\)

Return type:

array of shape (n_dimensions, n_nodes)

ugtm.ugtm_core.evalBetaInv(matY, betaInv, random_state=1234)

Decides which value to use for initial noise variance parameter.

Parameters:
  • matY (array of shape (n_dimensions, n_nodes)) – Manifold in n-dimensional space (projection of matX in data space);
  • betaInv (float) – The 3rd eigenvalue of the data covariance matrix.
  • random_state (int, optional) – Random state used to initialize BetaInv randomly in case of bad initialization.
Returns:

Noise variance parameter for the data distribution (betaInv). Written as \(\beta^{-1}\) in the original paper. Initialized to be the larger between: (1) the 3rd eigenvalue of the data covariance matrix (function parameter), (2) half the average distance between centers of Gaussian components. In case of bad initialization (betaInv = 0), betaInv is set to a random value (a message would then be displayed on screen).

Return type:

float

ugtm.ugtm_core.exp_normalize(x)

Exp-normalize trick: compute exp(x-max(x))

Parameters:2D array – An array x
Returns:y = exp(x-max(x))
Return type:2D array
ugtm.ugtm_core.initBetaInvRandom(matD, n_nodes, n_individuals, n_dimensions)

Computes initial noise variance parameter for kernel GTM.

Parameters:
  • matD (array of shape (n_nodes, n_individuals)) – Matrix of squared Euclidean distances between manifold and data.
  • n_nodes (int) – The number of nodes defining a grid in the 2D space.
  • n_individuals (int) – The number of data instances.
  • n_dimensions (int) – Data space dimensionality (number of variables).
Returns:

Noise variance parameter for the data distribution (betaInv). Written as \(\beta^{-1}\) in the original paper.

Return type:

float

ugtm.ugtm_core.meanPoint(matR, matX)

Computes mean positions for data points (usual GTM output).

Parameters:
  • matR (array of shape (n_nodes, n_individuals)) – Posterior probabilities (responsibilities).
  • matX (array of shape (n_nodes, 2)) – Coordinates of nodes defining a grid in the 2D space.
Returns:

Data representation in 2D space: mean positions (usual GTM output).

Return type:

array of shape (n_individuals, 2)

ugtm.ugtm_core.modePoint(matR, matX)

Computes modes (nodes with maximum responsibility for each data point).

Parameters:
  • matR (array of shape (n_nodes, n_individuals)) – Posterior probabilities (responsibilities).
  • matX (array of shape (n_nodes, 2)) – Coordinates of nodes defining a grid in the 2D space.
Returns:

Data representation in 2D space: modes (nodes with max responsibility).

Return type:

array of shape (n_individuals, 2)

ugtm.ugtm_core.optimBetaInv(matR, matD, n_dimensions)

Updates noise variance parameter.

Parameters:
  • matR (array of shape (n_nodes, n_individuals)) – Posterior probabilities (responsibilities).
  • matD (array of shape (n_nodes, n_individuals)) – Matrix of squared Euclidean distances between manifold and data.
Returns:

Updated noise variance parameter (\(\beta^{-1}\)).

Return type:

float

ugtm.ugtm_core.optimLMatrix(matR, matPhiMPlusOne, matG, betaInv, regul)

Updates parameter matrix regul for kernel GTM.

Parameters:
  • matR (array of shape (n_nodes, n_individuals)) – Posterior probabilities (responsibilities).
  • matPhiMPlusOne (array of shape (n_nodes, n_rbf_centers+1)) – RBF matrix plus one dimension to include a term for bias.
  • matG (array of shape (n_nodes, n_nodes)) – Diagonal matrix with elements \(G_{ii}=\sum_{n}^{n\_individuals} R_{in}\).
  • betaInv (float) – Noise variance parameter for the data distribution. Written as \(\beta^{-1}\) in the original paper.
  • regul (float) – Regularization coefficient.
Returns:

Updated parameter matrix regul.

Return type:

array of shape (n_individuals, n_rbf_centers+1)

ugtm.ugtm_core.optimWMatrix(matR, matPhiMPlusOne, matG, data, betaInv, regul)

Updates parameter matrix W.

Parameters:
  • matR (array of shape (n_nodes, n_individuals)) – Posterior probabilities (responsibilities).
  • matPhiMPlusOne (array of shape (n_nodes, n_rbf_centers+1)) – RBF matrix plus one dimension to include a term for bias.
  • matG (array of shape (n_nodes, n_nodes)) – Diagonal matrix with elements \(G_{ii}=\sum_{n}^{n\_individuals} R_{in}\).
  • data (array of shape (n_individuals, n_dimensions)) – Data matrix.
  • betaInv (float) – Noise variance parameter for the data distribution. Written as \(\beta^{-1}\) in the original paper.
  • regul (float) – Regularization coefficient.
Returns:

Updated parameter matrix W.

Return type:

array of shape (n_dimensions, n_rbf_centers+1)

ugtm.ugtm_crossvalidate module

Cross-validation support for GTC and GTR models (also SVM and PCA).

ugtm.ugtm_crossvalidate.crossvalidateGTC(data, labels, k=16, m=4, s=-1.0, regul=1.0, n_neighbors=1, niter=200, representation='modes', doPCA=False, n_components=-1, missing=False, missing_strategy='median', random_state=1234, predict_mode='bayes', prior='estimated', n_folds=5, n_repetitions=10)

Cross-validate GTC model.

Parameters:
  • data (array of shape (n_individuals, n_dimensions)) – Train set data matrix.
  • labels (array of shape (n_individuals, 1)) – Labels for train set.
  • k (int, optional (default = 16)) – If k is set to 0, k is computed as sqrt(5*sqrt(n_individuals))+2. k is the sqrt of the number of GTM nodes. One of four GTM hyperparameters (k, m, s, regul). Ex: k = 25 means the GTM will be discretized into a 25x25 grid.
  • m (int, optional (default = 4)) – If m is set to 0, m is computed as sqrt(k). (generally good rule of thumb). m is the qrt of the number of RBF centers. One of four GTM hyperparameters (k, m, s, regul). Ex: m = 5 means the RBF functions will be arranged on a 5x5 grid.
  • s (float, optional (default = -1)) – RBF width factor. Default (-1) is to try different values. Parameter to tune width of RBF functions. Impacts manifold flexibility.
  • regul (float, optional (default = -1)) – Regularization coefficient. Default (-1) is to try different values. Impacts manifold flexibility.
  • n_neighbors (int, optional (default = 1)) – Number of neighbors for kNN algorithm (number of nearest nodes). At the moment, n_neighbors for GTC is always equal to 1.
  • niter (int, optional (default = 200)) – Number of iterations for EM algorithm.
  • representation ({“modes”, “means”}) – 2D GTM representation for the test set, used for kNN algorithms: “modes” for position with max. responsibility, “means” for average position (usual GTM representation)
  • doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
  • n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
  • missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
  • missing_strategy (str, optional (default = ‘median’)) – Scikit-learn missing data strategy.
  • random_state (int, optional (default = 1234)) – Random state.
  • predict_mode ({“bayes”, “knn”}, optional) – Choose between nearest node algorithm (“knn”, output of predictNN()) or GTM Bayes classifier (“bayes”, output of predictBayes()). NB: the kNN algorithm is limited to only 1 nearest node at the moment (n_neighbors = 1).
  • prior ({“estimated”, “equiprobable”}, optional) – Type of prior used to build GTM class map (classMap()). Choose “estimated” to account for class imbalance.
  • n_folds (int, optional (default = 5)) – Number of CV folds.
  • n_repetitions (int, optional (default = 10)) – Number of CV iterations.
ugtm.ugtm_crossvalidate.crossvalidateGTR(data, labels, k=16, m=4, s=-1, regul=-1, n_neighbors=1, niter=200, representation='modes', doPCA=False, n_components=-1, missing=False, missing_strategy='median', random_state=1234, n_folds=5, n_repetitions=10)

Cross-validate GTR model.

Parameters:
  • data (array of shape (n_individuals, n_dimensions)) – Train set data matrix.
  • labels (array of shape (n_individuals, 1)) – Labels for train set.
  • k (int, optional (default = 16)) – If k is set to 0, k is computed as sqrt(5*sqrt(n_individuals))+2. k is the sqrt of the number of GTM nodes. One of four GTM hyperparameters (k, m, s, regul). Ex: k = 25 means the GTM will be discretized into a 25x25 grid.
  • m (int, optional (default = 4)) – If m is set to 0, m is computed as sqrt(k). (generally good rule of thumb). m is the qrt of the number of RBF centers. One of four GTM hyperparameters (k, m, s, regul). Ex: m = 5 means the RBF functions will be arranged on a 5x5 grid.
  • s (float, optional (default = -1)) – RBF width factor. Default (-1) is to try different values. Parameter to tune width of RBF functions. Impacts manifold flexibility.
  • regul (float, optional (default = -1)) – Regularization coefficient. Default (-1) is to try different values. Impacts manifold flexibility.
  • n_neighbors (int, optional (default = 1)) – Number of neighbors for kNN algorithm (number of nearest nodes).
  • niter (int, optional (default = 200)) – Number of iterations for EM algorithm.
  • representation ({“modes”, “means”}) – 2D GTM representation for the test set, used for kNN algorithms: “modes” for position with max. responsibility, “means” for average position (usual GTM representation)
  • doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
  • n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
  • missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
  • missing_strategy (str, optional (default = ‘median’)) – Scikit-learn missing data strategy.
  • random_state (int, optional (default = 1234)) – Random state.
  • n_folds (int, optional (default = 5)) – Number of CV folds.
  • n_repetitions (int, optional (default = 10)) – Number of CV iterations.
ugtm.ugtm_crossvalidate.crossvalidatePCAC(data, labels, n_neighbors=1, maxneighbours=11, doPCA=False, n_components=-1, missing=False, missing_strategy='median', random_state=1234, n_folds=5, n_repetitions=10)

Cross-validate PCA kNN classification model.

Parameters:
  • data (array of shape (n_individuals, n_dimensions)) – Train set data matrix.
  • labels (array of shape (n_individuals, 1)) – Labels for train set.
  • n_neighbors (int, optional (default = 1)) – Number of neighbors for kNN algorithm (number of nearest nodes).
  • max_neighbors (int, optional (default = 11)) – The function crossvalidates kNN models with k between n_neighbors and max_neighbors.
  • doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
  • n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
  • missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
  • missing_strategy (str, optional (default = ‘median’)) – Scikit-learn missing data strategy.
  • random_state (int, optional (default = 1234)) – Random state.
  • n_folds (int, optional (default = 5)) – Number of CV folds.
  • n_repetitions (int, optional (default = 10)) – Number of CV iterations.
ugtm.ugtm_crossvalidate.crossvalidatePCAR(data, labels, n_neighbors=1, maxneighbours=11, doPCA=False, n_components=-1, missing=False, missing_strategy='median', random_state=1234, n_folds=5, n_repetitions=10)

Cross-validate PCA kNN regression model.

Parameters:
  • data (array of shape (n_individuals, n_dimensions)) – Train set data matrix.
  • labels (array of shape (n_individuals, 1)) – Labels for train set.
  • n_neighbors (int, optional (default = 1)) – Number of neighbors for kNN algorithm (number of nearest nodes).
  • max_neighbors (int, optional (default = 11)) – The function crossvalidates kNN models with k between n_neighbors and max_neighbors.
  • doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
  • n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
  • missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
  • missing_strategy (str, optional (default = ‘median’)) – Scikit-learn missing data strategy.
  • random_state (int, optional (default = 1234)) – Random state.
  • n_folds (int, optional (default = 5)) – Number of CV folds.
  • n_repetitions (int, optional (default = 10)) – Number of CV iterations.
ugtm.ugtm_crossvalidate.crossvalidateSVC(data, labels, C=1.0, doPCA=False, n_components=-1, missing=False, missing_strategy='median', random_state=1234, n_folds=5, n_repetitions=10)

Cross-validate SVC model.

Parameters:
  • data (array of shape (n_individuals, n_dimensions)) – Train set data matrix.
  • labels (array of shape (n_individuals, 1)) – Labels for train set.
  • C (float, optional (default = 1.0)) – SVM regularization parameter.
  • doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
  • n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
  • missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
  • missing_strategy (str, optional (default = ‘median’)) – Scikit-learn missing data strategy.
  • random_state (int, optional (default = 1234)) – Random state.
  • n_folds (int, optional (default = 5)) – Number of CV folds.
  • n_repetitions (int, optional (default = 10)) – Number of CV iterations.
ugtm.ugtm_crossvalidate.crossvalidateSVCrbf(data, labels, C=1, gamma=1, doPCA=False, n_components=-1, missing=False, missing_strategy='median', random_state=1234, n_folds=5, n_repetitions=10)

Cross-validate SVC model with RBF kernel.

Parameters:
  • data (array of shape (n_individuals, n_dimensions)) – Train set data matrix.
  • labels (array of shape (n_individuals, 1)) – Labels for train set.
  • C (float, optional (default = 1)) – SVM regularization parameter.
  • gamma (float, optional (default = 1)) – RBF parameter.
  • doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
  • n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
  • missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
  • missing_strategy (str, optional (default = ‘median’)) – Scikit-learn missing data strategy.
  • random_state (int, optional (default = 1234)) – Random state.
  • n_folds (int, optional (default = 5)) – Number of CV folds.
  • n_repetitions (int, optional (default = 10)) – Number of CV iterations.
ugtm.ugtm_crossvalidate.crossvalidateSVR(data, labels, C=-1, epsilon=-1, doPCA=False, n_components=-1, missing=False, missing_strategy='median', random_state=1234, n_folds=5, n_repetitions=10)

Cross-validate SVR model with linear kernel.

Parameters:
  • data (array of shape (n_individuals, n_dimensions)) – Train set data matrix.
  • labels (array of shape (n_individuals, 1)) – Labels for train set.
  • C (float, optional (default = -1)) – SVM regularization parameter. If (C = -1), different values are tested.
  • epsilon (float, optional (default = -1)) – SVM tolerance parameter. If (epsilon = -1), different values are tested.
  • doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
  • n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
  • missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
  • missing_strategy (str, optional (default = ‘median’)) – Scikit-learn missing data strategy.
  • random_state (int, optional (default = 1234)) – Random state.
  • n_folds (int, optional (default = 5)) – Number of CV folds.
  • n_repetitions (int, optional (default = 10)) – Number of CV iterations.
ugtm.ugtm_crossvalidate.whichExperiment(data, labels, args, discrete=False)

ugtm.ugtm_gtm module

Functions to run GTM models.

ugtm.ugtm_gtm.initialize(data, k, m, s, random_state=1234)

Initializes a GTM model.

Parameters:
  • data (array of shape (n_individuals, n_dimensions)) – Data matrix.
  • k (int) – Sqrt of the number of GTM nodes. One of four GTM hyperparameters (k, m, s, regul). Ex: k = 25 means the GTM will be discretized into a 25x25 grid.
  • m (int) – Sqrt of the number of RBF centers. One of four GTM hyperparameters (k, m, s, regul). Ex: m = 5 means the RBF functions will be arranged on a 5x5 grid.
  • s (float) – RBF width factor. One of four GTM hyperparameters (k, m, s, regul). Parameter to tune width of RBF functions. Impacts manifold flexibility.
  • random_state (int, optional (default = 1234)) – Random state.
Returns:

Initial GTM model (not optimized).

Return type:

instance of InitialGTM

Notes

We use approximately the same notations as in the original GTM paper by C. Bishop et al. The initialization process is the following:

  1. GTM grid parameters: number of nodes = k*k, number of rbf centers = m*m
  2. Create node matrix X (matX, meshgrid of shape (k*k,2))
  3. Create rbf centers matrix M (matM, meshgrid of shape (m*m, 2))
  4. Initialize rbf width (rbfWidth, computeWidth())
  5. Create rbf matrix \(\Phi\) (matPhiMPlusOne, createPhiMatrix())
  6. Perform PCA on the data using sklearn’s PCA function
  7. Set U matrix to 2 first principal axes of data cov. matrix (matU)
  8. Initialize parameter matrix W using U and \(\Phi\) (matW, createWMatrix())
  9. Initialize manifold Y using W and \(\Phi\) (matY, createYMatrixInit())
  10. Set noise variance parameter (betaInv, evalBetaInv()) to the largest between: (1) the 3rd eigenvalue of the data covariance matrix (2) half the average distance between centers of Gaussian components.
  11. Store initial GTM model in InitialGTM object
ugtm.ugtm_gtm.optimize(data, initialModel, regul, niter, verbose=True)

Optimizes a GTM model.

Parameters:
  • data (array of shape (n_individuals, n_dimensions)) – Data matrix.
  • initialModel (instance of InitialGTM) – PCA-initialized GTM model. The initial model is separated from the optimized model so that different data sets can be potentially used for initialization and optimization.
  • regul (float) – Regularization coefficient.
  • niter (int) – Number of iterations for EM algorithm.
  • verbose (bool, optional (default = True)) – Verbose mode (outputs loglikelihood values during EM algorithm).
Returns:

Optimized GTM model.

Return type:

instance of OptimizedGTM

Notes

We use approximately the same notations as in the original GTM paper by C. Bishop et al. The GTM optimization process is the following:

  1. Create distance matrix D between manifold and data matrix (matD, createDistanceMatrix())

  2. Until convergence (\(\Delta(log likelihood) \leq 0.0001\)):

    1. Update data distribution P (matP, createPMatrix())
    2. Update responsibilities R (matR, createRMatrix())
    3. Update diagonal matrix G (matG, createGMatrix())
    4. Update parameter matrix W (matW, optimWMatrix())
    5. Update manifold matrix Y (matY, createYMatrix())
    6. Update distance matrix D (matD, createDistanceMatrix())
    7. Update noise variance parameter \(\beta^{-1}\) (betaInv, optimBetaInv())
    8. Estimate log likelihood and check for convergence (computelogLikelihood())
  3. Compute 2D GTM representation 1: means (matMeans, meanPoint())

  4. Compute 2D GTM representation 2: modes (matModes, modePoint())

  5. Store GTM model in OptimizedGTM object

ugtm.ugtm_gtm.projection(optimizedModel, new_data)

Project test set on optimized GTM model. No pre-processing involved.

Parameters:
  • optimizedModel (instance of OptimizedGTM) – Optimized GTM model, built using training set (train).
  • new_data (array of shape (n_test, n_dimensions)) – Test data matrix.
Returns:

Returns an instance of OptimizedGTM corresponding to the projected test set.

Return type:

instance of OptimizedGTM

Notes

The new_data must have been through exactly the same preprocessing as the data used to obtained the optimized GTM model. To get a function doing the preprocessing as well as projection on the map, cf. transform().

ugtm.ugtm_gtm.runGTM(data, k=16, m=4, s=0.3, regul=0.1, doPCA=False, n_components=-1, missing=True, missing_strategy='median', random_state=1234, niter=200, verbose=False)

Run GTM (wrapper for initialize + optimize).

Parameters:
  • data (array of shape (n_individuals, n_dimensions)) – Data matrix.
  • k (int, optional (default = 16)) – If k is set to 0, k is computed as sqrt(5*sqrt(n_individuals))+2. k is the sqrt of the number of GTM nodes. One of four GTM hyperparameters (k, m, s, regul). Ex: k = 25 means the GTM will be discretized into a 25x25 grid.
  • m (int, optional (default = 4)) – If m is set to 0, m is computed as sqrt(k). m is the qrt of the number of RBF centers. One of four GTM hyperparameters (k, m, s, regul). Ex: m = 5 means the RBF functions will be arranged on a 5x5 grid.
  • s (float, optional (default = 0.3)) – RBF width factor. One of four GTM hyperparameters (k, m, s, regul). Parameter to tune width of RBF functions. Impacts manifold flexibility.
  • regul (float, optional (default = 0.1)) – One of four GTM hyperparameters (k, m, s, regul). Regularization coefficient.
  • doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
  • n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
  • missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
  • missing_strategy (str (default = ‘median’)) – Scikit-learn missing data strategy.
  • random_state (int (default = 1234)) – Random state.
  • niter (int, optional (default = 200)) – Number of iterations for EM algorithm.
  • verbose (bool, optional (default = False)) – Verbose mode (outputs loglikelihood values during EM algorithm).
Returns:

Optimized GTM model.

Return type:

instance of OptimizedGTM

Notes

We use approximately the same notations as in the original GTM paper by C. Bishop et al. The initialization + optimization process the following:

  1. GTM initialization:

    1. GTM grid parameters: number of nodes = k*k, number of rbf centers = m*m
    2. Create node matrix X (matX, meshgrid of shape (k*k,2))
    3. Create rbf centers matrix M (matM, meshgrid of shape (m*m, 2))
    4. Initialize rbf width (rbfWidth, computeWidth())
    5. Create rbf matrix \(\Phi\) (matPhiMPlusOne, createPhiMatrix())
    6. Perform PCA on the data using sklearn’s PCA function
    7. Set U matrix to 2 first principal axes of data cov. matrix (matU)
    8. Initialize parameter matrix W using U and \(\Phi\) (matW, createWMatrix())
    9. Initialize manifold Y using W and \(\Phi\) (matY, createYMatrixInit())
    10. Set noise variance parameter (betaInv, evalBetaInv()) to the largest between: (1) the 3rd eigenvalue of the data covariance matrix (2) half the average distance between centers of Gaussian components.
  2. GTM optimization:

    1. Create distance matrix D between manifold and data matrix (matD, createDistanceMatrix())

    2. Until convergence (\(\Delta(log likelihood) \leq 0.0001\)):

      1. Update data distribution P (matP, createPMatrix())
      2. Update responsibilities R (matR, createRMatrix())
      3. Update diagonal matrix G (matG, createGMatrix())
      4. Update parameter matrix W (matW, optimWMatrix())
      5. Update manifold matrix Y (matY, createYMatrix())
      6. Update distance matrix D (matD, createDistanceMatrix())
      7. Update noise variance parameter \(\beta^{-1}\) (betaInv, optimBetaInv())
      8. Estimate log likelihood and check for convergence (computelogLikelihood())
    3. Compute 2D GTM representation 1: means (matMeans, meanPoint())

    4. Compute 2D GTM representation 2: modes (matModes, modePoint())

    5. Store GTM model in OptimizedGTM object

ugtm.ugtm_gtm.transform(optimizedModel, train, test, doPCA=False, n_components=-1, missing=True, missing_strategy='median', random_state=1234, process=True)

Preprocess and project test set on optimized GTM model.

Parameters:
  • optimizedModel (instance of OptimizedGTM) – Optimized GTM model, built using training set (train).
  • train (array of shape (n_train, n_dimensions)) – Training data matrix.
  • test (array of shape (n_test, n_dimensions)) – Test data matrix.
  • doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
  • n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
  • missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
  • missing_strategy (str (default = ‘median’)) – Scikit-learn missing data strategy.
  • random_state (int (default = 1234)) – Random state.
  • process (bool (default = True)) – Apply preprocessing (missing, PCA) to train set and use values from train set to preprocess test set.
Returns:

Returns an instance of OptimizedGTM corresponding to the projected test set.

Return type:

instance of OptimizedGTM

ugtm.ugtm_kgtm module

Functions to initialize and optimize kernel GTM models.

ugtm.ugtm_kgtm.initializeKernel(data, k, m, s, maxdim, random_state=1234)

Initializes a kernel GTM (kGTM) model.

Parameters:
  • data (array of shape (n_individuals, n_dimensions)) – Data matrix.
  • k (int) – Sqrt of the number of GTM nodes. Ex: k = 25 means the GTM will be discretized into a 25x25 grid.
  • m (int) – Sqrt of the number of RBF centers. Ex: m = 5 means the RBF functions will be arranged on a 5x5 grid.
  • s (float) – RBF width factor. Parameter to tune width of RBF functions. Impacts manifold flexibility.
  • maxdim (int) – Max boundary for internal dimensionality estimation.
  • random_state (int, optional (default = 1234)) – Random state.
Returns:

Initial GTM model (not optimized).

Return type:

instance of InitialGTM

ugtm.ugtm_kgtm.optimizeKernel(data, initialModel, regul, niter, verbose=True)

Optimizes a kGTM model.

Parameters:
  • data (array of shape (n_individuals, n_dimensions)) – Data matrix.
  • initialModel (instance of InitialGTM) – Initial kGTM model. The initial model is separate from the optimized model so that different data sets can be potentially used for initialization and optimization.
  • regul (float) – Regularization coefficient.
  • niter (int) – Number of iterations for EM algorithm.
  • verbose (bool, optional (default = True)) – Verbose mode (outputs loglikelihood values during EM algorithm).
Returns:

Optimized kGTM model.

Return type:

instance of OptimizedGTM

ugtm.ugtm_kgtm.runkGTM(data, k=16, m=4, s=0.3, regul=0.1, maxdim=100, doPCA=False, doKernel=False, kernel='linear', n_components=-1, missing=True, missing_strategy='median', random_state=1234, niter=200, verbose=False)

Run kGTM algorithm (wrapper for initialize + optimize).

Parameters:
  • data (array of shape (n_individuals, n_dimensions)) – Data matrix.
  • k (int, optional (default = 16)) – If k is set to 0, k is computed as sqrt(5*sqrt(n_individuals))+2. k is the sqrt of the number of GTM nodes. One of four GTM hyperparameters (k, m, s, regul). Ex: k = 25 means the GTM will be discretized into a 25x25 grid.
  • m (int, optional (default = 4)) – If m is set to 0, m is computed as sqrt(k). m is the qrt of the number of RBF centers. One of four GTM hyperparameters (k, m, s, regul). Ex: m = 5 means the RBF functions will be arranged on a 5x5 grid.
  • s (float, optional (default = 0.3)) – RBF width factor. Parameter to tune width of RBF functions. Impacts manifold flexibility.
  • regul (float, optional (default = 0.1)) – One of four GTM hyperparameters (k, m, s, regul). Regularization coefficient.
  • maxdim (int) – Max boundary for internal dimensionality estimation. Internal dimensionality is estimated as number of principal components accounting for 99.5% of data variance. If this value is higher than maxdim, it is replaced by maxim.
  • doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
  • doKernel (bool, optional (default = False)) – If doKernel is False, the data is supposed to be a kernel already. If doKernel is True, a kernel will be computed from the data.
  • kernel (scikit-learn kernel (default = “linear”))
  • n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
  • missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
  • missing_strategy (str (default = ‘median’)) – Scikit-learn missing data strategy.
  • random_state (int (default = 1234)) – Random state.
  • niter (int, optional (default = 200)) – Number of iterations for EM algorithm.
  • verbose (bool, optional (default = False)) – Verbose mode (outputs loglikelihood values during EM algorithm).
Returns:

Optimized kGTM model.

Return type:

instance of OptimizedGTM

ugtm.ugtm_landscape module

Builds continuous GTM class maps or landscapes using labels or activities.

class ugtm.ugtm_landscape.ClassMap(nodeClassP, nodeClassT, activityModel, uniqClasses)

Bases: object

Class for ClassMap: Bayesian classification model for each GTM node.

Parameters:
  • nodeClassT (array of shape (n_nodes, n_classes)) – Likelihood of each node \(k\) given class \(C_i\): \(P(k|C_i) = \frac{\sum_{i_{c}}R_{i_{c},k}}{N_c}\).
  • nodeClassP (array of shape (n_nodes, n_classes)) – Posterior probabilities of each class \(C_i\) for each node \(k\): \(P(C_i|k) =\frac{P(k|C_i)P(C_i)}{\sum_i P(k|C_i)P(C_i)}\)
  • activityModel (array of shape (n_nodes,1)) – Class label attributed to each GTM node on the GTM node grid. Computed using argmax of posterior probabilities.
  • uniqClasses (array of shape (n_classes,1)) – Unique class labels.
ugtm.ugtm_landscape.classMap(optimizedModel, activity, prior='estimated')

Computes GTM class map based on discrete activities (= discrete labels)

Parameters:
  • optimizedModel (an instance of OptimizedGTM) – The optimized GTM model.
  • activity (array of shape (n_individuals,1)) – Activity vector (discrete labels) associated with the data used to compute the optimized GTM model.
  • prior ({estimated, equiprobable}, optional) – Type of prior used for Bayesian classifier. “equiprobable” assigns the same weight to all classes: \(P(C_i)=1/N_{classes}\). “estimated” accounts for class imbalance using the number of individuals in each class \(N(C_i)\): \(P(C_i)=N_{C_i}/N_{total}\)
Returns:

Computes a GTM bayesian model and returns an instance of ClassMap.

Return type:

instance of ClassMap

Notes

This function computes the likelihood of each GTM node given a class, the posterior probabilities of each class (using Bayes’ theorem), and the class attributed to each node:

  1. output.nodeClassT: likelihood of each node \(k\) given class \(C_i\): \(P(k|C_i) = \frac{\sum_{i_{c}}R_{i_{c},k}}{N_c}\).
  2. output.nodeClassP: posterior probabilities of each class \(C_i\) for each node \(k\), using piors \(P(C_i)\): \(P(C_i|k) =\frac{P(k|C_i)P(C_i)}{\sum_i P(k|C_i)P(C_i)}\)
  3. output.activityModel:
    Class label attributed to each GTM node on the GTM node grid. Computed using argmax of posterior probabilities.
ugtm.ugtm_landscape.landscape(optimizedModel, activity)

Computes GTM landscapes based on activities (= continuous labels).

Parameters:
  • optimizedModel (an instance of OptimizedGTM) – The optimized GTM model.
  • activity (array of shape (n_individuals,1)) – Activity vector (continuous labels) associated with the data used to compute the optimized GTM model.
Returns:

Activity landscape: associates each GTM node \(k\) on the GTM node grid with an activity value, which is computed as an average mean of data activity values (continuous labels). If a = activities, r_k = vector of optimized GTM responsibilities for node k, and N = n_individuals: \(landscape_k = \frac{\mathbf{a \cdot r}_k}{\sum_i^{N}r_{ik}}\)

Return type:

array of shape (n_nodes,1)

ugtm.ugtm_plot module

ugtm plot functions.

class ugtm.ugtm_plot.NumpyEncoder(skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, encoding='utf-8', default=None)

Bases: json.encoder.JSONEncoder

default(obj)

Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

For example, to support arbitrary iterators, you could implement default like this:

def default(self, o):
    try:
        iterable = iter(o)
    except TypeError:
        pass
    else:
        return list(iterable)
    # Let the base class default method raise the TypeError
    return JSONEncoder.default(self, o)
ugtm.ugtm_plot.plot(coordinates, labels=None, title='', output='output', discrete=False, pointsize=1.0, alpha=0.3, cname='Spectral_r', output_format='pdf')

Simple plotting function.

Parameters:
  • coordinates (array of shape (n_individuals, 2)) – Coordinates to plot.
  • labels (array of shape (n_individuals,), optional (default = None)) – Data labels.
  • title (str, optional (default = ‘’)) – Plot title.
  • output (str, optional (default = ‘ouptut’)) – Output path for plot.
  • discrete (bool (default = False)) – Type of label; discrete=True if labels are nominal or binary.
  • pointsize (float, optional (default = ‘1’)) – Marker size.
  • alpha (float, optional (default = ‘0.3’),) – Marker transparency.
  • cname (str, optional (default = ‘Spectral_r’),) – Name of matplotlib color map. Cf. https://matplotlib.org/examples/color/colormaps_reference.html
  • output_format ({‘pdf’, “png’, ‘ps’, ‘eps’, ‘svg’}) – Output format.
Returns:

Basic plotting function. Plots coordinates.

Return type:

Image file

ugtm.ugtm_plot.plotClassMap(optimizedModel, labels, prior='estimated', do_interpolate=True, cname='Spectral_r', pointsize=1.0, alpha=0.3)

Plots GTM class map. Internal usage.

ugtm.ugtm_plot.plotClassMapNoPoints(optimizedModel, labels, prior='estimated', do_interpolate=True, cname='Spectral_r')

Plots GTM class map without 2D representations. Internal usage.

ugtm.ugtm_plot.plotLandscape(optimizedModel, labels, do_interpolate=True, cname='Spectral_r', pointsize=1.0, alpha=0.3)

Plots GTM landscape. Internal usage.

ugtm.ugtm_plot.plotLandscapeNoPoints(optimizedModel, labels, do_interpolate=True, cname='Spectral_r')

Plots GTM landscape without 2D representations. Internal usage.

ugtm.ugtm_plot.plotMultiPanelGTM(optimizedModel, labels, output='output', discrete=False, pointsize=1.0, alpha=0.3, do_interpolate=True, cname='Spectral_r', prior='estimated')

Multipanel visualization for GTM object - PDF output.

Parameters:
  • optimizedModel (instance of OptimizedGTM,) – Optimized GTM model.
  • labels (array of shape (n_individuals,), optional (default = None)) – Data labels.
  • output (str, optional (default = ‘ouptut’)) – Output path for plot.
  • discrete (bool (default = False)) – Type of label; discrete=True if labels are nominal or binary.
  • pointsize (float, optional (default = ‘1’)) – Marker size.
  • alpha (float, optional (default = ‘0.3’)) – Marker transparency.
  • do_interpolate (bool, optional (default = True)) – Interpolate color between grid nodes.
  • cname (str, optional (default = ‘Spectral_r’)) – Name of matplotlib color map.
  • prior ({‘estimated’,’equiprobable’}) – Type of prior used to compute class probabilities on the map. Choose equiprobable for equiprobable priors. Estimated priors take into account class imbalance.
Returns:

Four plots are returned: (1) means, (2) modes, (3) landscape with means, (4) landscape with modes.

Return type:

PDF image

ugtm.ugtm_plot.plot_html(coordinates, labels=None, ids=None, title='plot', discrete=False, output='output', pointsize=1.0, alpha=0.3, cname='Spectral_r')

Simple plotting function: HTML output.

Parameters:
  • coordinates (array of shape (n_individuals, 2)) – Coordinates to plot.
  • labels (array of shape (n_individuals,), optional (default = None)) – Data labels.
  • title (str, optional (default = ‘’)) – Plot title.
  • output (str, optional (default = ‘ouptut’)) – Output path for plot.
  • discrete (bool (default = False)) – Type of label; discrete=True if labels are nominal or binary.
  • pointsize (float, optional (default = ‘1’)) – Marker size.
  • alpha (float, optional (default = ‘0.3’),) – Marker transparency.
  • cname (str, optional (default = ‘Spectral_r’),) – Name of matplotlib color map. Cf. https://matplotlib.org/examples/color/colormaps_reference.html
Returns:

Plots coordinates to HTML file.

Return type:

Image file

ugtm.ugtm_plot.plot_html_GTM(optimizedModel, labels=None, ids=None, plot_arrows=True, title='GTM', discrete=False, output='output', pointsize=1.0, alpha=0.3, do_interpolate=True, cname='Spectral_r', prior='estimated')

Plotting function for GTM object - HTML output.

Parameters:
  • optimizedModel (instance of ugtm.ugtm_classes.OptimizedGTM) – OptimizedGTM model.
  • labels (array of shape (n_individuals,), optional (default = None)) – Data labels.
  • ids (array of shape (n_individuals,), optional (default = None)) – Identifiers for each data point - appears in tooltips.
  • title (str, optional (default = ‘’)) – Plot title.
  • output (str, optional (default = ‘ouptut’)) – Output path for plot.
  • discrete (bool (default = False)) – Type of label; discrete=True if labels are nominal or binary.
  • pointsize (float, optional (default = ‘1’)) – Marker size.
  • alpha (float, optional (default = ‘0.3’)) – Marker transparency.
  • do_interpolate (bool, optional (default = True)) – Interpolate color between grid nodes.
  • cname (str, optional (default = ‘Spectral_r’)) – Name of matplotlib color map.
  • prior ({‘estimated’,’equiprobable’}) – Type of prior used to compute class probabilities on the map. Choose equiprobable for equiprobable priors. Estimated priors take into account class imbalance.
Returns:

HTML file of GTM output (mean coordinates). If labels are provided, a landscape (continuous or discrete depending on the labels) is drawn in the background. This landscape is computed using responsibilities and is indicative of the average activity or label value at a given node on the map.

Return type:

HTML file

Notes

May be time-consuming for large datasets.

ugtm.ugtm_plot.plot_html_GTM_projection(optimizedModel, projections, labels=None, ids=None, plot_arrows=True, title='GTM_projection', discrete=False, output='output', pointsize=1, alpha=0.3, do_interpolate=True, cname='Spectral_r', prior='estimated')

Returns a GTM landscape with projected data points - HTML output.

Parameters:
  • optimizedModel (instance of ugtm.ugtm_classes.OptimizedGTM) – OptimizedGTM model.
  • projections (instance of ugtm.ugtm_classes.OptimizedGTM) – OptimizedGTM model.
  • labels (array of shape (n_train,), optional (default = None)) – Data labels for the training set (not projections).
  • ids (array of shape (n_test,), optional (default = None)) – Identifiers for each projected data point - appears in tooltips.
  • title (str, optional (default = ‘’)) – Plot title.
  • output (str, optional (default = ‘ouptut’)) – Output path for plot.
  • discrete (bool (default = False)) – Type of label; discrete=True if labels are nominal or binary.
  • pointsize (float, optional (default = ‘1’)) – Marker size.
  • alpha (float, optional (default = ‘0.3’)) – Marker transparency.
  • do_interpolate (bool, optional (default = True)) – Interpolate color between grid nodes.
  • cname (str, optional (default = ‘Spectral_r’)) – Name of matplotlib color map.
  • prior ({‘estimated’,’equiprobable’}) – Type of prior used to compute class probabilities on the map. Choose equiprobable for equiprobable priors. Estimated priors take into account class imbalance.
Returns:

HTML file of GTM output (mean coordinates). This function plots a GTM model (from a training set) with projected points (= mean positions of projected test data). If labels are provided, a landscape (continuous or discrete depending on the labels) is drawn in the background. This landscape is computed using responsibilities and is indicative of the average activity or label value at a given node on the map.

Return type:

HTML file

Notes

  • May be time-consuming for large datasets.
  • The labels correspond to training data (the optimized model).
  • The ids (identifiers) correspond to the test data (projections).
ugtm.ugtm_plot.plot_pdf(*args, **kwargs)

ugtm.ugtm_predictions module

GTC (GTM classification) and GTR (GTM regression)

ugtm.ugtm_predictions.GTC(train, labels, test, k=16, m=4, s=0.3, regul=0.1, n_neighbors=1, niter=200, representation='modes', doPCA=False, n_components=-1, missing=False, missing_strategy='median', random_state=1234, predict_mode='bayes', prior='estimated')

Run GTC (GTM classification): Bayes or nearest node algorithm.

Parameters:
  • train (array of shape (n_train, n_dimensions)) – Train set data matrix.
  • labels (array of shape (n_train, 1)) – Labels for train set.
  • test (array of shape (n_test, n_dimensions)) – Test set data matrix.
  • k (int, optional (default = 16)) – If k is set to 0, k is computed as sqrt(5*sqrt(n_individuals))+2. k is the sqrt of the number of GTM nodes. One of four GTM hyperparameters (k, m, s, regul). Ex: k = 25 means the GTM will be discretized into a 25x25 grid.
  • m (int, optional (default = 4)) – If m is set to 0, m is computed as sqrt(k). (generally good rule of thumb). m is the qrt of the number of RBF centers. One of four GTM hyperparameters (k, m, s, regul). Ex: m = 5 means the RBF functions will be arranged on a 5x5 grid.
  • s (float, optional (default = 0.3)) – RBF width factor. One of four GTM hyperparameters (k, m, s, regul). Parameter to tune width of RBF functions. Impacts manifold flexibility.
  • regul (float, optional (default = 0.1)) – One of four GTM hyperparameters (k, m, s, regul). Regularization coefficient. Impacts manifold flexibility.
  • n_neighbors (int, optional (default = 1)) – Number of neighbors for kNN algorithm (number of nearest nodes). At the moment, n_neighbors is always equal to 1.
  • niter (int, optional (default = 200)) – Number of iterations for EM algorithm.
  • representation ({“modes”, “means”}) – 2D GTM representation for the test set, used for kNN algorithms: “modes” for position with max. responsibility, “means” for average position (usual GTM representation)
  • doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
  • n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
  • missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
  • missing_strategy (str, optional (default = ‘median’)) – Scikit-learn missing data strategy.
  • random_state (int, optional (default = 1234)) – Random state.
  • predict_mode ({“bayes”, “knn”}, optional) – Choose between nearest node algorithm (“knn”, output of predictNN()) or GTM Bayes classifier (“bayes”, output of predictBayes()). NB: the kNN algorithm is limited to only 1 nearest node at the moment (n_neighbors = 1).
  • prior ({“estimated”, “equiprobable”}, optional) – Type of prior used to build GTM class map (classMap()). Choose “estimated” to account for class imbalance.
Returns:

Predicted class for test set individuals.

Return type:

array of shape (n_test, 1)

Notes

The GTM nearest node classifier (predict_mode = “knn”, predictNN()):

  1. A GTM class map (GTM colored by class) is built using the training set (classMap()); the class map is discretized into nodes, and each node has a class label
  2. The test set is projected onto the GTM map
  3. A 2D GTM representation is chosen for the test set (representation = modes or means)
  4. Nearest node on the GTM map is found for each test set individual
  5. The predicted label for each individual is the label of its nearest node on the GTM map

The GTM Bayes classifier (predict_mode = “bayes”, predictBayes()):

  1. A GTM class map (GTM colored by class) is built using the training set (classMap()); the class map is discretized into nodes, and each node has posterior class probabilities
  2. The test set is projected onto the GTM map
  3. The GTM representation for each individual is its responsibility vector (posterior probability distribution on the map)
  4. The probabilities of belonging to each class for a specific individual are computed as an average of posterior class probabilities (array of shape (n_nodes_n,classes)), weighted by the individual’s responsibilities on the GTM map (array of shape (1, n_nodes))
ugtm.ugtm_predictions.GTR(train, labels, test, k=16, m=4, s=0.3, regul=0.1, n_neighbors=1, niter=200, representation='modes', doPCA=False, n_components=-1, missing=False, missing_strategy='median', random_state=1234)

Run GTR (GTM nearest node(s) regression).

Parameters:
  • train (array of shape (n_train, n_dimensions)) – Train set data matrix.
  • labels (array of shape (n_train, 1)) – Labels for train set.
  • test (array of shape (n_test, n_dimensions)) – Test set data matrix.
  • k (int, optional (default = 16)) – If k is set to 0, k is computed as sqrt(5*sqrt(n_individuals))+2. k is the sqrt of the number of GTM nodes. One of four GTM hyperparameters (k, m, s, regul). Ex: k = 25 means the GTM will be discretized into a 25x25 grid.
  • m (int, optional (default = 4)) – If m is set to 0, m is computed as sqrt(k). (generally good rule of thumb). m is the qrt of the number of RBF centers. One of four GTM hyperparameters (k, m, s, regul). Ex: m = 5 means the RBF functions will be arranged on a 5x5 grid.
  • s (float, optional (default = 0.3)) – RBF width factor. One of four GTM hyperparameters (k, m, s, regul). Parameter to tune width of RBF functions. Impacts manifold flexibility.
  • regul (float, optional (default = 0.1)) – One of four GTM hyperparameters (k, m, s, regul). Regularization coefficient. Impacts manifold flexibility.
  • n_neighbors (int, optional (default = 1)) – Number of neighbors for kNN algorithm (number of nearest nodes).
  • niter (int, optional (default = 200)) – Number of iterations for EM algorithm.
  • representation ({“modes”, “means”}) – 2D GTM representation for the test set, used for kNN algorithms: “modes” for position with max. responsibility, “means” for average position (usual GTM representation)
  • doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
  • n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
  • missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
  • missing_strategy (str, optional (default = ‘median’)) – Scikit-learn missing data strategy.
  • random_state (int, optional (default = 1234)) – Random state.
Returns:

Predicted class for test set individuals.

Return type:

array of shape (n_test, 1)

Notes

The GTM nearest node(s) regression (predictNN()):

  1. A GTM landscape (GTM colored by activity) is built using the training set (landscape()); the landscape is discretized into nodes, and each node has an estimated activity value
  2. The test set is projected onto the GTM map
  3. A 2D GTM representation is chosen for the test set (representation = modes or means)
  4. Nearest node(s) on the GTM map is found for each test set individual
  5. The predicted activity for each individual is a weighted average of nearest node activities.
ugtm.ugtm_predictions.advancedGTC(train, labels, test, n_neighbors=1, representation='modes', niter=200, k=16, m=4, regul=0.1, s=0.3, doPCA=False, n_components=-1, missing=False, missing_strategy='median', random_state=1234, predict_mode='bayes', prior='estimated')

Run GTC (GTM classification): advanced Bayes

Parameters:
  • train (array of shape (n_train, n_dimensions)) – Train set data matrix.
  • labels (array of shape (n_train, 1)) – Labels for train set.
  • test (array of shape (n_test, n_dimensions)) – Test set data matrix.
  • k (int, optional (default = 16)) – If k is set to 0, k is computed as sqrt(5*sqrt(n_individuals))+2. k is the sqrt of the number of GTM nodes. One of four GTM hyperparameters (k, m, s, regul). Ex: k = 25 means the GTM will be discretized into a 25x25 grid.
  • m (int, optional (default = 4)) – If m is set to 0, m is computed as sqrt(k). (generally good rule of thumb). m is the qrt of the number of RBF centers. One of four GTM hyperparameters (k, m, s, regul). Ex: m = 5 means the RBF functions will be arranged on a 5x5 grid.
  • s (float, optional (default = 0.3)) – RBF width factor. One of four GTM hyperparameters (k, m, s, regul). Parameter to tune width of RBF functions. Impacts manifold flexibility.
  • regul (float, optional (default = 0.1)) – One of four GTM hyperparameters (k, m, s, regul). Regularization coefficient. Impacts manifold flexibility.
  • n_neighbors (int, optional (default = 1)) – Number of neighbors for kNN algorithm (number of nearest nodes). At the moment, n_neighbors is always equal to 1.
  • niter (int, optional (default = 200)) – Number of iterations for EM algorithm.
  • representation ({“modes”, “means”}) – 2D GTM representation for the test set, used for kNN algorithms: “modes” for position with max. responsibility, “means” for average position (usual GTM representation)
  • doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
  • n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
  • missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
  • missing_strategy (str, optional (default = ‘median’)) – Scikit-learn missing data strategy.
  • random_state (int, optional (default = 1234)) – Random state.
  • predict_mode ({“bayes”}, optional) – At the moment, only the GTM Bayes classifier is available; (“bayes”, output of advancedPredictBayes()).
  • prior ({“estimated”, “equiprobable”}, optional) – Type of prior used to build GTM class map (classMap()). Choose “estimated” to account for class imbalance.
Returns:

The output is a dictionary defined as follows:

  1. output[“optimizedModel”]: original training set GTM model, instance of OptimizedGTM
  2. output[“indiv_projections”]: test set GTM model, instance of OptimizedGTM
  3. output[“indiv_probabilities”]: class probabilities for each individual (= dot product between test responsibility matrix and posterior class probabilities)
  4. output[“indiv_predictions”]: class prediction for each individual (argmax of output[“indiv_probabilities”])
  5. output[“group_projections”]: average responsibility vector for the entire test set
  6. output[“group_probabilities”]: posterior class probabilities for the entire test set (dot product between output[“group_projections”] and posterior class probabilities)
  7. output[“uniqClasses”]: classes

Return type:

a dict

Notes

The GTM nearest node classifier (predict_mode = “knn”, predictNN()):

  1. A GTM class map (GTM colored by class) is built using the training set (classMap()); the class map is discretized into nodes, and each node has a class label
  2. The test set is projected onto the GTM map
  3. A 2D GTM representation is chosen for the test set (representation = modes or means)
  4. Nearest node on the GTM map is found for each test set individual
  5. The predicted label for each individual is the label of its nearest node on the GTM map

The GTM Bayes classifier (predict_mode = “bayes”, predictBayes()):

  1. A GTM class map (GTM colored by class) is built using the training set (classMap()); the class map is discretized into nodes, and each node has posterior class probabilities
  2. The test set is projected onto the GTM map
  3. The GTM representation for each individual is its responsibility vector (posterior probability distribution on the map)
  4. The probabilities of belonging to each class for a specific individual are computed as an average of posterior class probabilities (array of shape (n_nodes_n,classes)), weighted by the individual’s responsibilities on the GTM map (array of shape (1, n_nodes))
ugtm.ugtm_predictions.advancedPredictBayes(optimizedModel, labels, new_data, prior='estimated')

Bayesian GTM classifier: complete model

Parameters:
  • optimizedModel (instance of OptimizedGTM) – Optimized GTM model built using a training set of shape (n_individuals, n_dimensions)
  • labels (array of shape (n_individuals, 1)) – Labels (discrete or continuous) associated with training set
  • new_data (array of shape (n_test, n_dimensions)) – New data matrix (test set).
  • prior ({‘estimated’, ‘equiprobable’}, optional) – Only used for classification. Sets priors (Bayes’ theorem) in classMap().
Returns:

The output is a dictionary defined as follows:

  1. output[“optimizedModel”]: original training set GTM model, instance of OptimizedGTM
  2. output[“indiv_projections”]: test set GTM model, instance of OptimizedGTM
  3. output[“indiv_probabilities”]: class probabilities for each individual (= dot product between test responsibility matrix and posterior class probabilities)
  4. output[“indiv_predictions”]: class prediction for each individual (argmax of output[“indiv_probabilities”])
  5. output[“group_projections”]: average responsibility vector for the entire test set
  6. output[“group_probabilities”]: posterior class probabilities for the entire test set (dot product between output[“group_projections”] and posterior class probabilities)
  7. output[“uniqClasses”]: classes

Return type:

a dict

Notes

This function computes GTM class predictions by using posterior probabilities of classes weighted by responsibilities.

  1. generate GTM class map (classMap());
  2. Project new data (projection()) on optimized GTM model (OptimizedGTM)
  3. Projected data responsibilities R are used as weights to find outcome \(C_{max}\) for each tested instance: \(C_{max} = \operatorname*{arg\,max}_C \sum_k{R_{ki} P(C|k)}\)

The algorithm is the same as in predictBayes(), but this function returns a complete output including original training set optimized GTM model, test set GTM model, individual class probabilities for each individual, class prediction for each individual, group projections (average position of the whole test set on the map), class probabilities for the whole test set, and classes used to build the classification model.

ugtm.ugtm_predictions.predictBayes(optimizedModel, labels, new_data, prior='estimated')

Bayesian GTM classifier (GTC Bayes).

Parameters:
  • optimizedModel (instance of OptimizedGTM) – Optimized GTM model built using a training set of shape (n_individuals, n_dimensions)
  • labels (array of shape (n_individuals, 1)) – Labels (discrete or continuous) associated with training set
  • new_data (array of shape (n_test, n_dimensions)) – New data matrix (test set).
  • prior ({‘estimated’, ‘equiprobable’}, optional) – Only used for classification. Sets priors (Bayes’ theorem) in classMap().
Returns:

Predicted outcome.

Return type:

array of shape (n_test, 1)

Notes

This function computes GTM class predictions by using posterior probabilities of classes weighted by responsibilities. Similar to maximum a posterior (MAP) estimator.

  1. generate GTM class map (classMap());
  2. Project new data (projection()) on optimized GTM model (OptimizedGTM)
  3. Projected data responsibilities R are used as weights to find outcome \(C_{max}\) for each tested instance: \(C_{max} = \operatorname*{arg\,max}_C \sum_k{R_{ki} P(C|k)}\)
ugtm.ugtm_predictions.predictNN(optimizedModel, labels, new_data, modeltype='regression', n_neighbors=1, representation='modes', prior='estimated')

GTM nearest node(s) classification or regression.

Parameters:
  • optimizedModel (instance of OptimizedGTM) – Optimized GTM model built using a training set of shape (n_individuals, n_dimensions)
  • labels (array of shape (n_individuals, 1)) – Labels (discrete or continuous) associated with training set
  • new_data (array of shape (n_test, n_dimensions)) – New data matrix (test set).
  • modeltype ({‘classification’, ‘regression’}, optional) – Choice between classification and regression.
  • n_neighbors (int, optional (default = 1)) – Number of nodes to take into account in kNN algorithm. NB: for classification, n_neighbors is always equal to 1.
  • representation ({‘modes’, ‘means’}, optional) – Defines GTM representation type: mean or mode of responsibilities.
  • prior ({‘estimated’, ‘equiprobable’}, optional) – Only used for classification. Sets priors (Bayes’ theorem) in classMap().
Returns:

Predicted outcome.

Return type:

array of shape (n_test, 1)

Notes

This function implements classification or regression based on nearest GTM nodes:

  1. If (modeltype == ‘classification’), generate GTM class map (classMap()); if (modeltype == ‘regression’), generate GTM landscape (landscape())
  2. Project new data (projection()) on optimized GTM model (OptimizedGTM)
  3. Depending on provided parameters, choose means or modes as GTM coordinates for the new data
  4. Find the nodes closest to the new data GTM coordinates (sklearn function kneighbors)
  5. Retrieve predicted outcomes corresponding to nodes on class map (classification task) or landscape (regression task)
  6. If (modeltype == ‘classification’), the predicted outcome is the outcome of the nearest node on the class map; if (modeltype == ‘regression’), the predicted outcome is the average outcome of the k nearest nodes (k = n_neighbors), weighted by inverse squared distances (weights=1/((dist)**2))
ugtm.ugtm_predictions.predictNNSimple(train, test, labels, n_neighbors=1, modeltype='regression')

Nearest neighbor(s) classification or regression.

Parameters:
  • train (array of shape (n_train, n_dimensions)) – Train set data matrix.
  • test (array of shape (n_test, n_dimensions)) – Test set data matrix.
  • labels (array of shape (n_train, 1)) – Labels (discrete or continuous) for the training set.
  • n_neighbors (int, optional (default = 1)) – Number of nodes to take into account in kNN algorithm.
  • modeltype ({‘classification’, ‘regression’}, optional) – Choice between classification and regression.
Returns:

Predicted outcome.

Return type:

array of shape (n_test, 1)

Notes

This function implements classification or regression based on classical kNN algorithm.

ugtm.ugtm_predictions.printClassPredictions(prediction, output)

Print output of advancedPredictBayes().

Parameters:
  • prediction (dict) – Output of advancedPredictBayes(). With following keys: “optimizedModel”: OptimizedGTM, “indiv_projections”: OptimizedGTM, “indiv_probabilities”: array of shape (n_individuals, n_classes), “indiv_predictions”: array of shape (n_individuals, 1), “group_projections”: array of shape (n_nodes, 1), “group_probabilities”: array of shape (n_probabilities, 1), “uniqClasses”: array of shape(n_classes, 1)
  • output (str) – Output path to write class prediction model (prediction dictionary).
Returns:

  1. output_indiv_probabilities.csv
  2. output_indiv_predictions.csv
  3. output_group_probabilities.csv

Return type:

CSV files

ugtm.ugtm_preprocess module

Preprocessing operations (mostly using scikit-learn functions).

class ugtm.ugtm_preprocess.ProcessedTrainTest(train, test)

Bases: object

Class for processed train and test set.

Parameters:
  • train (array of shape (n_train, n_dimensions)) – Train data matrix.
  • test (array of shape (n_test, ndimensions)) – Test data matrix.
ugtm.ugtm_preprocess.chooseKernel(data, kerneltype='euclidean')

Kernalize data (uses sklearn)

Parameters:
  • data (array of shape (n_individuals, n_dimensions)) – Data matrix.
  • kerneltype ({‘euclidean’, ‘cosine’, ‘laplacian’, ‘polynomial_kernel’, ‘jaccard’}, optional) – Kernel type.
Returns:

Return type:

array of shape (n_individuals, n_individuals)

ugtm.ugtm_preprocess.pcaPreprocess(data, doPCA=False, n_components=-1, missing=False, missing_strategy='median', random_state=1234)

Preprocess data using PCA.

Parameters:
  • data (array of shape (n_individuals, n_dimensions)) – Data matrix.
  • doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
  • n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
  • missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
  • missing_strategy (str (default = ‘median’)) – Scikit-learn missing data strategy.
  • random_state (int (default = 1234)) – Random state.
Returns:

Data projected onto principal axes.

Return type:

array of shape (n_individuals, n_components)

ugtm.ugtm_preprocess.processTrainTest(train, test, doPCA, n_components, missing=False, missing_strategy='median', random_state=1234)

Preprocess train and test data using PCA.

Parameters:
  • train (array of shape (n_individuals, n_train)) – Train data matrix.
  • test (array of shape (n_individuals, n_test)) – Test data matrix.
  • doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
  • n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
  • missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
  • missing_strategy (str (default = ‘median’)) – Scikit-learn missing data strategy.
  • random_state (int (default = 1234)) – Random state.
Returns:

Return type:

instance of ProcessedTrainTest

ugtm.ugtm_sklearn module

GTM transformer, classifier and regressor compatible with sklearn

class ugtm.ugtm_sklearn.eGTC(k=16, m=4, s=0.3, regul=0.1, random_state=1234, niter=200, verbose=False, prior='estimated')

Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

eGTC : GTC Bayesian classifier for sklearn pipelines.

Parameters:
  • k (int, optional (default = 16)) – If k is set to 0, k is computed as sqrt(5*sqrt(n_individuals))+2. k is the sqrt of the number of GTM nodes. One of four GTM hyperparameters (k, m, s, regul). Ex: k = 25 means the GTM will be discretized into a 25x25 grid.
  • m (int, optional (default = 4)) – If m is set to 0, m is computed as sqrt(k). m is the qrt of the number of RBF centers. One of four GTM hyperparameters (k, m, s, regul). Ex: m = 5 means the RBF functions will be arranged on a 5x5 grid.
  • s (float, optional (default = 0.3)) – RBF width factor. One of four GTM hyperparameters (k, m, s, regul). Parameter to tune width of RBF functions. Impacts manifold flexibility.
  • regul (float, optional (default = 0.1)) – One of four GTM hyperparameters (k, m, s, regul). Regularization coefficient.
  • random_state (int (default = 1234)) – Random state.
  • niter (int, optional (default = 200)) – Number of iterations for EM algorithm.
  • verbose (bool, optional (default = False)) – Verbose mode (outputs loglikelihood values during EM algorithm).
  • prior ({‘estimated’, ‘equiprobable’}) – Type of prior for class map. Use ‘estimated’ to account for class imbalance.
fit(X, y)

Constructs activity model f(X,y) using classMap().

Parameters:
  • X (array of shape (n_instances, n_dimensions)) – Data matrix.
  • y (array of shape (n_instances,)) – Data labels.
predict(X)

Predicts new labels for X using projection().

Parameters:X (array of shape (n_instances, n_dimensions)) – Data matrix.
class ugtm.ugtm_sklearn.eGTCnn(k=16, m=4, s=0.3, regul=0.1, random_state=1234, niter=200, verbose=False, prior='estimated', representation='modes')

Bases: sklearn.base.BaseEstimator, sklearn.base.RegressorMixin

eGTCnn: GTC nearest node classifier for sklearn pipelines.

Parameters:
  • k (int, optional (default = 16)) – If k is set to 0, k is computed as sqrt(5*sqrt(n_individuals))+2. k is the sqrt of the number of GTM nodes. One of four GTM hyperparameters (k, m, s, regul). Ex: k = 25 means the GTM will be discretized into a 25x25 grid.
  • m (int, optional (default = 4)) – If m is set to 0, m is computed as sqrt(k). m is the qrt of the number of RBF centers. One of four GTM hyperparameters (k, m, s, regul). Ex: m = 5 means the RBF functions will be arranged on a 5x5 grid.
  • s (float, optional (default = 0.3)) – RBF width factor. One of four GTM hyperparameters (k, m, s, regul). Parameter to tune width of RBF functions. Impacts manifold flexibility.
  • regul (float, optional (default = 0.1)) – One of four GTM hyperparameters (k, m, s, regul). Regularization coefficient.
  • random_state (int (default = 1234)) – Random state.
  • niter (int, optional (default = 200)) – Number of iterations for EM algorithm.
  • verbose (bool, optional (default = False)) – Verbose mode (outputs loglikelihood values during EM algorithm).
  • prior ({‘estimated’, ‘equiprobable’}) – Type of prior for class map. Use ‘estimated’ to account for class imbalance.
  • representation ({‘modes’, ‘means’}, optional) – Type of 2D representation used in kNN algorithm.
fit(X, y)

Constructs activity model f(X,y) using classMap().

Parameters:
  • X (array of shape (n_instances, n_dimensions)) – Data matrix.
  • y (array of shape (n_instances,)) – Data labels.
predict(X)

Predicts new labels for X using projection().

Parameters:X (array of shape (n_instances, n_dimensions)) – Data matrix.
class ugtm.ugtm_sklearn.eGTM(k=16, m=4, s=0.3, regul=0.1, random_state=1234, niter=200, verbose=False, model='means')

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

eGTM: GTM Transformer for sklearn pipeline.

Parameters:
  • k (int, optional (default = 16)) – If k is set to 0, k is computed as sqrt(5*sqrt(n_individuals))+2. k is the sqrt of the number of GTM nodes. One of four GTM hyperparameters (k, m, s, regul). Ex: k = 25 means the GTM will be discretized into a 25x25 grid.
  • m (int, optional (default = 4)) – If m is set to 0, m is computed as sqrt(k). m is the qrt of the number of RBF centers. One of four GTM hyperparameters (k, m, s, regul). Ex: m = 5 means the RBF functions will be arranged on a 5x5 grid.
  • s (float, optional (default = 0.3)) – RBF width factor. One of four GTM hyperparameters (k, m, s, regul). Parameter to tune width of RBF functions. Impacts manifold flexibility.
  • regul (float, optional (default = 0.1)) – One of four GTM hyperparameters (k, m, s, regul). Regularization coefficient.
  • random_state (int (default = 1234)) – Random state.
  • niter (int, optional (default = 200)) – Number of iterations for EM algorithm.
  • verbose (bool, optional (default = False)) – Verbose mode (outputs loglikelihood values during EM algorithm).
  • model ({‘means’, ‘modes’, ‘responsibilities’,’complete’}, optional) – GTM data representations: ‘means’ for mean data positions, ‘modes’ for positions with max. responsibilities, ‘responsibilities’ for probability distribution on the map, ‘complete’ for a complete instance of OptimizedGTM
fit(X, y=None)

Fits GTM to X using OptimizedGTM.

Parameters:X (2D array) – Data matrix.
fit_transform(X, y=None)

Fits and transforms X using GTM.

Parameters:X (2D array) – Data matrix.
Returns:
  • if self.model=”means”, array of shape (n_instances, 2),
  • if self.model=”modes”, array of shape (n_instances, 2),
  • if self.model=”responsibilities”, array of shape (n_instances, n_nodes),
  • if self.model=”complete”, instance of class OptimizedGTM
inverse_transform(matR)

Inverse transformation of responsibility onto the original data space

Parameters:matR (array of shape (n_samples, n_nodes))
Returns:matY
Return type:array of shape (n_samples, n_dimensions)
transform(X)

Projects new data X onto GTM using projection().

Parameters:X (2D array) – Data matrix.
Returns:
  • if self.model=”means”, array of shape (n_instances, 2),
  • if self.model=”modes”, array of shape (n_instances, 2),
  • if self.model=”responsibilities”, array of shape (n_instances, n_nodes),
  • if self.model=”complete”, instance of class OptimizedGTM
class ugtm.ugtm_sklearn.eGTR(k=16, m=4, s=0.3, regul=0.1, random_state=1234, niter=200, verbose=False, n_neighbors=2, representation='modes')

Bases: sklearn.base.BaseEstimator, sklearn.base.RegressorMixin

eGTR: GTM nearest node(s) regressor for sklearn pipelines.

Parameters:
  • k (int, optional (default = 16)) – If k is set to 0, k is computed as sqrt(5*sqrt(n_individuals))+2. k is the sqrt of the number of GTM nodes. One of four GTM hyperparameters (k, m, s, regul). Ex: k = 25 means the GTM will be discretized into a 25x25 grid.
  • m (int, optional (default = 4)) – If m is set to 0, m is computed as sqrt(k). m is the qrt of the number of RBF centers. One of four GTM hyperparameters (k, m, s, regul). Ex: m = 5 means the RBF functions will be arranged on a 5x5 grid.
  • s (float, optional (default = 0.3)) – RBF width factor. One of four GTM hyperparameters (k, m, s, regul). Parameter to tune width of RBF functions. Impacts manifold flexibility.
  • regul (float, optional (default = 0.1)) – One of four GTM hyperparameters (k, m, s, regul). Regularization coefficient.
  • random_state (int (default = 1234)) – Random state.
  • niter (int, optional (default = 200)) – Number of iterations for EM algorithm.
  • verbose (bool, optional (default = False)) – Verbose mode (outputs loglikelihood values during EM algorithm).
  • prior ({‘estimated’, ‘equiprobable’}) – Type of prior for class map. Use ‘estimated’ to account for class imbalance.
  • n_neighbors (int, optional (default = 2)) – Number of neighbors for kNN algorithm.
  • representation ({‘modes’, ‘means’}, optional) – Type of 2D representation used in kNN algorithm.
fit(X, y)

Constructs activity model f(X,y) using landscape().

Parameters:
  • X (array of shape (n_instances, n_dimensions)) – Data matrix.
  • y (array of shape (n_instances,)) – Data labels.
predict(X)

Predicts new labels for X using projection().

Parameters:X (array of shape (n_instances, n_dimensions)) – Data matrix.

Module contents

ugtm: a python package for Generative Topographic Mapping (GTM)

Submodules

ugtm_sklearn GTM transformer, classifier and regressor compatible with sklearn
ugtm_gtm Functions to run GTM models.
ugtm_kgtm Functions to initialize and optimize kernel GTM models.
ugtm_classes Defines classes for initial and optimized GTM model.
ugtm_plot ugtm plot functions.
ugtm_landscape Builds continuous GTM class maps or landscapes using labels or activities.
ugtm_predictions GTC (GTM classification) and GTR (GTM regression)
ugtm_crossvalidate Cross-validation support for GTC and GTR models (also SVM and PCA).
ugtm_preprocess Preprocessing operations (mostly using scikit-learn functions).