ugtm package

Submodules

ugtm.ugtm_classes module

Defines classes for initial and optimized GTM model.

class ugtm.ugtm_classes.InitialGTM(matX, matM, n_nodes, n_rbf_centers, rbfWidth, matPhiMPlusOne, matW, matY, betaInv, n_dimensions)[source]

Bases: object

Class for initial GTM model.

Parameters:

matX (array of shape (n_nodes, 2)) – Coordinates of nodes defining a grid in the 2D space.
matM (array of shape (n_rbf_centers, 2)) – Coordinates of radial basis function (RBF) centers, defining a grid in the 2D space.
n_nodes (int) – The number of nodes defining a grid in the 2D space.
n_rbf_centers (int) – The number of radial basis function (RBF) centers.
rbfWidth (float) – Initial radial basis function (RBF) width. This is set to the average of the minimum distance between RBF centers: $rbfWidth=\sigma \times average(\mathbf{distances(rbf)}_{min})$, where $sigma$ is the GTM hyperparameter s. NB: if GTM hyperparameter s = 0 (not recommended), rbfWidth is set to the maximum distance between RBF centers.
matPhiMPlusOne (array of shape (n_nodes, n_rbf_centers+1)) – RBF matrix plus one dimension to include a term for bias.
matW (array of shape (n_dimensions, n_rbf_centers+1)) – Parameter matrix (PCA-initialized).
matY (array of shape (n_dimensions, n_nodes)) – Manifold in n-dimensional space (projection of matX in data space); A point matY[:,i] is a center of Gaussian component in data space. $\mathbf{Y}=\mathbf{W}\mathbf{\Phi}^T$
betaInv (float) – Noise variance parameter for the data distribution. Written as $\beta^{-1}$ in the original paper. Initialized to be the larger between: (1) the 3rd eigenvalue of the data covariance matrix, (2) half the average distance between Gaussian component centers in the data space (matY matrix).
n_dimensions (int) – Data space dimensionality (number of variables).

class ugtm.ugtm_classes.OptimizedGTM(matW, matY, matP, matR, betaInv, matMeans, matModes, matX, n_dimensions, converged)[source]

Bases: object

Class for optimized GTM model.

Variables:

matX (array of shape (n_nodes, 2)) – Coordinates of nodes defining a grid in the 2D space.
matW (array of shape (n_dimensions, n_rbf_centers+1)) – Parameter matrix (PCA-initialized).
matY (array of shape (n_dimensions, n_nodes)) – Manifold in n-dimensional space (projection of matX in data space). matY = np.dot(matW, np.transpose(matPhiMPlusOne))
matP (array of shape (n_individuals, n_nodes)) – Data distribution with variance betaInv.
matR (array of shape (n_individuals, n_nodes)) – Responsibilities (posterior probabilities), used to compute data representations: means (matMeans) and modes (matModes). Responsibilities are the main output of GTM. matR[i,:] represents the responsibility vector for an instance i. The columns in matR correspond to rows in matX (nodes).
betaInv (float) – Noise variance parameter for the data distribution. Written as $\beta^{-1}$ in the original paper.
matMeans (array of shape (n_individuals, 2)) – Data representation in 2D space: means (most commonly used for GTM).
matModes (array of shape(n_individuals, 2)) – Data representation in 2D space: modes (for each instance, coordinate with highest responsibility).
n_dimensions (int) – Data space dimensionality (number of variables).
converged (bool) – True if the model has converged; otherwise False.

write(output='output')[source]

Write optimized GTM model: means, modes and responsibilities.

Parameters:: output (str, optional (default = ‘output’)) – Output path.
Returns:: Separate files for (1) means (mean position for each data point), (2) modes (node with max. responsibility for each data point), (3) responsibilities (posterior probabilities for each data point)
Return type:: CSV files

write_all(output='output')[source]

Write optimized GTM model and optimized parameters.

Parameters:: output (str, optional (default = ‘output’)) – Output path.
Returns:: Separate files for (1) means (mean position for each data point), (2) modes (node with max. responsibility for each data point), (3) responsibilities (posterior probabilities for each data point), (4) initial space dimension and data distribution variance, (5) manifold coordinates (matY), (6) parameter matrix (matW)
Return type:: CSV files

class ugtm.ugtm_classes.ReturnU(matU, betaInv)[source]: Bases: object

ugtm.ugtm_core module

Core linear algebra operations for GTM and kGTM

ugtm.ugtm_core.KERNELcreateDistanceMatrix(data, matL, matPhiMPlusOne)[source]

Computes distances between data and manifold for kernel algorithm.

Parameters:

data (array of shape (n_individuals, n_dimensions)) – Data matrix.
matL (array of shape (n_individuals, n_rbf_centers+1)) – Parameter matrix (regul).
matPhiMPlusOne (array of shape (n_nodes, n_rbf_centers+1)) – RBF matrix plus one dimension to include a term for bias.

Returns:

Matrix of distances between manifold and data.

Return type:

array of shape (n_nodes, n_individuals)

ugtm.ugtm_core.computeWidth(matM, numM, sigma)[source]

Initializes radial basis function width using hyperparameter sigma.

Parameters:

matM (array of shape (n_rbf_centers, 2)) – Coordinates of radial basis function (RBF) centers, defining a grid in the 2D space.
numM (int) – Number of RBF centers (n_rbf_centers)
sigma (float) – RBF width factor.

Returns:

Initial radial basis function (RBF) width.

Return type:

float

ugtm.ugtm_core.computelogLikelihood(matP, betaInv, n_dimensions)[source]

Computes log likelihood = GTM objective function

Parameters:

matP (array of shape (n_nodes, n_individuals)) – Data distribution with variance betaInv (transformed: exp(x-max(x)))
betaInv (float) – Noise variance parameter for the data distribution. Written as $\beta^{-1}$ in the original paper.
n_dimensions (int) – Data space dimensionality (number of variables).

Returns:

Log likelihood.

Return type:

float

ugtm.ugtm_core.createDistanceMatrix(matY, data)[source]

Computes distances between manifold centers and data vectors.

Parameters:

matY (array of shape (n_dimensions, n_nodes)) – Manifold in n-dimensional space (projection of matX in data space);
data (array of shape (n_individuals, n_dimensions)) – Data matrix.

Returns:

Matrix of squared Euclidean distances between manifold and data.

Return type:

array of shape (n_nodes, n_individuals)

ugtm.ugtm_core.createGMatrix(matR)[source]

Creates the G diagonal matrix from responsibilities (R)

Parameters:: matR (array of shape (n_nodes, n_individuals)) – Posterior probabilities (responsibilities).
Returns:: Diagonal matrix with elements $G_{ii}=\sum_{n}^{n\_individuals} R_{in}$.
Return type:: array of shape (n_nodes, n_nodes)

ugtm.ugtm_core.createPMatrix(matD, betaInv, n_dimensions)[source]

Computes data distribution matrix = exp(-(parameter)*distances).

Parameters:

matD (array of shape (n_nodes, n_individuals)) – Matrix of squared Euclidean distances between manifold and data.
betaInv (float) – Noise variance parameter for the data distribution. Written as $\beta^{-1}$ in the original paper.
n_dimensions (int) – Data space dimensionality (number of variables).

Returns:

Data distribution with variance betaInv (transformed: exp(x-max(x)))

Return type:

array of shape (n_nodes, n_individuals)

Notes

Important: this data distribution is not exact per se and is to be used as input for createRMatrix (responsibilities).

ugtm.ugtm_core.createPhiMatrix(matX, matM, numX, numM, sigma)[source]

Creates matrix of RBF functions.

Parameters:

matX (array of shape (n_nodes, 2)) – Coordinates of nodes defining a grid in the 2D space.
matM (array of shape (n_rbf_centers, 2)) – Coordinates of radial basis function (RBF) centers, defining a grid in the 2D space.
numX (int) – Number of nodes (n_nodes).
numM (int) – Number of RBF centers (n_rbf_centers)
sigma (float) – RBF width factor.

Returns:

RBF matrix plus one dimension to include a term for bias.

Return type:

array of shape (n_nodes, n_rbf_centers+1)

ugtm.ugtm_core.createRMatrix(matP)[source]

Computes responsibilities (posterior probabilities).

Parameters:: matP (array of shape (n_nodes, n_individuals)) – Data distribution with variance betaInv (transformed: exp(x-max(x)))
Returns:: Posterior probabilities (responsibilities).
Return type:: array of shape (n_nodes, n_individuals)

ugtm.ugtm_core.createWMatrix(matX, matPhiMPlusOne, matU, n_dimensions, n_rbf_centers)[source]

Creates PCA-initialized parameter matrix W.

Parameters:

matX (array of shape (n_nodes, 2)) – Coordinates of nodes defining a grid in the 2D space.
matPhiMPlusOne (array of shape (n_nodes, n_rbf_centers+1)) – RBF matrix plus one dimension to include a term for bias.
matU (array of shape (n_dimensions, 2)) – 2 first principal axes of data covariance matrix.
n_dimensions (int) – Data space dimensionality (number of variables).
n_rbf_centers (int) – Number of RBF centers.
sigma (float) – RBF width factor.

Returns:

Parameter matrix W (PCA-initialized).

Return type:

array of shape (n_dimensions, n_rbf_centers+1)

ugtm.ugtm_core.createYMatrix(matW, matPhiMPlusOne)[source]

Updates manifold matrix (Y) using new parameter matrix (W).

Parameters:

matW (array of shape (n_dimensions, n_rbf_centers+1)) – Parameter matrix (PCA-initialized).
matPhiMPlusOne (array of shape (n_nodes, n_rbf_centers+1)) – RBF matrix plus one dimension to include a term for bias.

Returns:

Manifold in n-dimensional space (projection of matX in data space); A point matY[:,i] is a center of Gaussian component in data space. $\mathbf{Y}=\mathbf{W}\mathbf{\Phi}^T$

Return type:

array of shape (n_dimensions, n_nodes)

ugtm.ugtm_core.createYMatrixInit(data, matW, matPhiMPlusOne)[source]

Creates initial manifold matrix (Y).

Parameters:

data (array of shape (n_individuals, n_dimensions)) – Data matrix.
matW (array of shape (n_dimensions, n_rbf_centers+1)) – Parameter matrix (PCA-initialized).
matPhiMPlusOne (array of shape (n_nodes, n_rbf_centers+1)) – RBF matrix plus one dimension to include a term for bias.

Returns:

Manifold in n-dimensional space (projection of matX in data space); A point matY[:,i] is a center of Gaussian component in data space. $\mathbf{Y}=\mathbf{W}\mathbf{\Phi}^T$

Return type:

array of shape (n_dimensions, n_nodes)

ugtm.ugtm_core.evalBetaInv(matY, betaInv, random_state=1234)[source]

Decides which value to use for initial noise variance parameter.

Parameters:

matY (array of shape (n_dimensions, n_nodes)) – Manifold in n-dimensional space (projection of matX in data space);
betaInv (float) – The 3rd eigenvalue of the data covariance matrix.
random_state (int, optional) – Random state used to initialize BetaInv randomly in case of bad initialization.

Returns:

Noise variance parameter for the data distribution (betaInv). Written as $\beta^{-1}$ in the original paper. Initialized to be the larger between: (1) the 3rd eigenvalue of the data covariance matrix (function parameter), (2) half the average distance between centers of Gaussian components. In case of bad initialization (betaInv = 0), betaInv is set to a random value (a message would then be displayed on screen).

Return type:

float

ugtm.ugtm_core.exp_normalize(x)[source]

Exp-normalize trick: compute exp(x-max(x))

Parameters:: 2D array – An array x
Returns:: y = exp(x-max(x))
Return type:: 2D array

ugtm.ugtm_core.initBetaInvRandom(matD, n_nodes, n_individuals, n_dimensions)[source]

Computes initial noise variance parameter for kernel GTM.

Parameters:

matD (array of shape (n_nodes, n_individuals)) – Matrix of squared Euclidean distances between manifold and data.
n_nodes (int) – The number of nodes defining a grid in the 2D space.
n_individuals (int) – The number of data instances.
n_dimensions (int) – Data space dimensionality (number of variables).

Returns:

Noise variance parameter for the data distribution (betaInv). Written as $\beta^{-1}$ in the original paper.

Return type:

float

ugtm.ugtm_core.meanPoint(matR, matX)[source]

Computes mean positions for data points (usual GTM output).

Parameters:

matR (array of shape (n_nodes, n_individuals)) – Posterior probabilities (responsibilities).
matX (array of shape (n_nodes, 2)) – Coordinates of nodes defining a grid in the 2D space.

Returns:

Data representation in 2D space: mean positions (usual GTM output).

Return type:

array of shape (n_individuals, 2)

ugtm.ugtm_core.modePoint(matR, matX)[source]

Computes modes (nodes with maximum responsibility for each data point).

Parameters:

matR (array of shape (n_nodes, n_individuals)) – Posterior probabilities (responsibilities).
matX (array of shape (n_nodes, 2)) – Coordinates of nodes defining a grid in the 2D space.

Returns:

Data representation in 2D space: modes (nodes with max responsibility).

Return type:

array of shape (n_individuals, 2)

ugtm.ugtm_core.optimBetaInv(matR, matD, n_dimensions)[source]

Updates noise variance parameter.

Parameters:

matR (array of shape (n_nodes, n_individuals)) – Posterior probabilities (responsibilities).
matD (array of shape (n_nodes, n_individuals)) – Matrix of squared Euclidean distances between manifold and data.

Returns:

Updated noise variance parameter ($\beta^{-1}$).

Return type:

float

ugtm.ugtm_core.optimLMatrix(matR, matPhiMPlusOne, matG, betaInv, regul)[source]

Updates parameter matrix regul for kernel GTM.

Parameters:

matR (array of shape (n_nodes, n_individuals)) – Posterior probabilities (responsibilities).
matPhiMPlusOne (array of shape (n_nodes, n_rbf_centers+1)) – RBF matrix plus one dimension to include a term for bias.
matG (array of shape (n_nodes, n_nodes)) – Diagonal matrix with elements $G_{ii}=\sum_{n}^{n\_individuals} R_{in}$.
betaInv (float) – Noise variance parameter for the data distribution. Written as $\beta^{-1}$ in the original paper.
regul (float) – Regularization coefficient.

Returns:

Updated parameter matrix regul.

Return type:

array of shape (n_individuals, n_rbf_centers+1)

ugtm.ugtm_core.optimWMatrix(matR, matPhiMPlusOne, matG, data, betaInv, regul)[source]

Updates parameter matrix W.

Parameters:

matR (array of shape (n_nodes, n_individuals)) – Posterior probabilities (responsibilities).
matPhiMPlusOne (array of shape (n_nodes, n_rbf_centers+1)) – RBF matrix plus one dimension to include a term for bias.
matG (array of shape (n_nodes, n_nodes)) – Diagonal matrix with elements $G_{ii}=\sum_{n}^{n\_individuals} R_{in}$.
data (array of shape (n_individuals, n_dimensions)) – Data matrix.
betaInv (float) – Noise variance parameter for the data distribution. Written as $\beta^{-1}$ in the original paper.
regul (float) – Regularization coefficient.

Returns:

Updated parameter matrix W.

Return type:

array of shape (n_dimensions, n_rbf_centers+1)

ugtm.ugtm_core.optimWMatrixAcc(g_vec, RT_acc, matPhiMPlusOne, betaInv, regul)[source]

Updates W from block-accumulated sufficient statistics (iGTM M-step).

Parameters:

g_vec (array of shape (n_nodes,)) – Accumulated row sums of R across all blocks: $\sum_b \mathbf{R}_b \mathbf{1}$.
RT_acc (array of shape (n_nodes, n_dimensions)) – Accumulated R @ X across all blocks: $\sum_b \mathbf{R}_b \mathbf{X}_b$.
matPhiMPlusOne (array of shape (n_nodes, n_rbf_centers+1)) – RBF matrix plus bias column.
betaInv (float) – Noise variance parameter.
regul (float) – Regularization coefficient.

Returns:

Updated parameter matrix W.

Return type:

array of shape (n_dimensions, n_rbf_centers+1)

Notes

Equivalent to optimWMatrix() but avoids forming the full N×N responsibility matrix by using pre-accumulated statistics. $\mathbf{\Phi}^T \mathbf{G} \mathbf{\Phi}$ is computed as $(\mathbf{\Phi} \odot \mathbf{g})^T \mathbf{\Phi}$ to avoid an explicit n_nodes×n_nodes diagonal matrix.

ugtm.ugtm_crossvalidate module

Cross-validation support for GTC and GTR models (also SVM and PCA).

ugtm.ugtm_crossvalidate.crossvalidateGTC(data, labels, k=16, m=4, s=-1.0, regul=1.0, n_neighbors=1, niter=200, representation='modes', doPCA=False, n_components=-1, missing=False, missing_strategy='median', random_state=1234, predict_mode='bayes', prior='estimated', n_folds=5, n_repetitions=10)[source]

Cross-validate GTC model.

Parameters:

data (array of shape (n_individuals, n_dimensions)) – Train set data matrix.
labels (array of shape (n_individuals, 1)) – Labels for train set.
k (int, optional (default = 16)) – If k is set to 0, k is computed as sqrt(5*sqrt(n_individuals))+2. k is the sqrt of the number of GTM nodes. One of four GTM hyperparameters (k, m, s, regul). Ex: k = 25 means the GTM will be discretized into a 25x25 grid.
m (int, optional (default = 4)) – If m is set to 0, m is computed as sqrt(k). (generally good rule of thumb). m is the qrt of the number of RBF centers. One of four GTM hyperparameters (k, m, s, regul). Ex: m = 5 means the RBF functions will be arranged on a 5x5 grid.
s (float, optional (default = -1)) – RBF width factor. Default (-1) is to try different values. Parameter to tune width of RBF functions. Impacts manifold flexibility.
regul (float, optional (default = -1)) – Regularization coefficient. Default (-1) is to try different values. Impacts manifold flexibility.
n_neighbors (int, optional (default = 1)) – Number of neighbors for kNN algorithm (number of nearest nodes). At the moment, n_neighbors for GTC is always equal to 1.
niter (int, optional (default = 200)) – Number of iterations for EM algorithm.
representation ({“modes”, “means”}) – 2D GTM representation for the test set, used for kNN algorithms: “modes” for position with max. responsibility, “means” for average position (usual GTM representation)
doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
missing_strategy (str, optional (default = ‘median’)) – Scikit-learn missing data strategy.
random_state (int, optional (default = 1234)) – Random state.
predict_mode ({“bayes”, “knn”}, optional) – Choose between nearest node algorithm (“knn”, output of predictNN()) or GTM Bayes classifier (“bayes”, output of predictBayes()). NB: the kNN algorithm is limited to only 1 nearest node at the moment (n_neighbors = 1).
prior ({“estimated”, “equiprobable”}, optional) – Type of prior used to build GTM class map (classMap()). Choose “estimated” to account for class imbalance.
n_folds (int, optional (default = 5)) – Number of CV folds.
n_repetitions (int, optional (default = 10)) – Number of CV iterations.

ugtm.ugtm_crossvalidate.crossvalidateGTR(data, labels, k=16, m=4, s=-1, regul=-1, n_neighbors=1, niter=200, representation='modes', doPCA=False, n_components=-1, missing=False, missing_strategy='median', random_state=1234, n_folds=5, n_repetitions=10)[source]

Cross-validate GTR model.

Parameters:

data (array of shape (n_individuals, n_dimensions)) – Train set data matrix.
labels (array of shape (n_individuals, 1)) – Labels for train set.
k (int, optional (default = 16)) – If k is set to 0, k is computed as sqrt(5*sqrt(n_individuals))+2. k is the sqrt of the number of GTM nodes. One of four GTM hyperparameters (k, m, s, regul). Ex: k = 25 means the GTM will be discretized into a 25x25 grid.
m (int, optional (default = 4)) – If m is set to 0, m is computed as sqrt(k). (generally good rule of thumb). m is the qrt of the number of RBF centers. One of four GTM hyperparameters (k, m, s, regul). Ex: m = 5 means the RBF functions will be arranged on a 5x5 grid.
s (float, optional (default = -1)) – RBF width factor. Default (-1) is to try different values. Parameter to tune width of RBF functions. Impacts manifold flexibility.
regul (float, optional (default = -1)) – Regularization coefficient. Default (-1) is to try different values. Impacts manifold flexibility.
n_neighbors (int, optional (default = 1)) – Number of neighbors for kNN algorithm (number of nearest nodes).
niter (int, optional (default = 200)) – Number of iterations for EM algorithm.
representation ({“modes”, “means”}) – 2D GTM representation for the test set, used for kNN algorithms: “modes” for position with max. responsibility, “means” for average position (usual GTM representation)
doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
missing_strategy (str, optional (default = ‘median’)) – Scikit-learn missing data strategy.
random_state (int, optional (default = 1234)) – Random state.
n_folds (int, optional (default = 5)) – Number of CV folds.
n_repetitions (int, optional (default = 10)) – Number of CV iterations.

ugtm.ugtm_crossvalidate.crossvalidatePCAC(data, labels, n_neighbors=1, maxneighbours=11, doPCA=False, n_components=-1, missing=False, missing_strategy='median', random_state=1234, n_folds=5, n_repetitions=10)[source]

Cross-validate PCA kNN classification model.

Parameters:

data (array of shape (n_individuals, n_dimensions)) – Train set data matrix.
labels (array of shape (n_individuals, 1)) – Labels for train set.
n_neighbors (int, optional (default = 1)) – Number of neighbors for kNN algorithm (number of nearest nodes).
max_neighbors (int, optional (default = 11)) – The function crossvalidates kNN models with k between n_neighbors and max_neighbors.
doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
missing_strategy (str, optional (default = ‘median’)) – Scikit-learn missing data strategy.
random_state (int, optional (default = 1234)) – Random state.
n_folds (int, optional (default = 5)) – Number of CV folds.
n_repetitions (int, optional (default = 10)) – Number of CV iterations.

ugtm.ugtm_crossvalidate.crossvalidatePCAR(data, labels, n_neighbors=1, maxneighbours=11, doPCA=False, n_components=-1, missing=False, missing_strategy='median', random_state=1234, n_folds=5, n_repetitions=10)[source]

Cross-validate PCA kNN regression model.

Parameters:

data (array of shape (n_individuals, n_dimensions)) – Train set data matrix.
labels (array of shape (n_individuals, 1)) – Labels for train set.
n_neighbors (int, optional (default = 1)) – Number of neighbors for kNN algorithm (number of nearest nodes).
max_neighbors (int, optional (default = 11)) – The function crossvalidates kNN models with k between n_neighbors and max_neighbors.
doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
missing_strategy (str, optional (default = ‘median’)) – Scikit-learn missing data strategy.
random_state (int, optional (default = 1234)) – Random state.
n_folds (int, optional (default = 5)) – Number of CV folds.
n_repetitions (int, optional (default = 10)) – Number of CV iterations.

ugtm.ugtm_crossvalidate.crossvalidateSVC(data, labels, C=1.0, doPCA=False, n_components=-1, missing=False, missing_strategy='median', random_state=1234, n_folds=5, n_repetitions=10)[source]

Cross-validate SVC model.

Parameters:

data (array of shape (n_individuals, n_dimensions)) – Train set data matrix.
labels (array of shape (n_individuals, 1)) – Labels for train set.
C (float, optional (default = 1.0)) – SVM regularization parameter.
doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
missing_strategy (str, optional (default = ‘median’)) – Scikit-learn missing data strategy.
random_state (int, optional (default = 1234)) – Random state.
n_folds (int, optional (default = 5)) – Number of CV folds.
n_repetitions (int, optional (default = 10)) – Number of CV iterations.

ugtm.ugtm_crossvalidate.crossvalidateSVCrbf(data, labels, C=1, gamma=1, doPCA=False, n_components=-1, missing=False, missing_strategy='median', random_state=1234, n_folds=5, n_repetitions=10)[source]

Cross-validate SVC model with RBF kernel.

Parameters:

data (array of shape (n_individuals, n_dimensions)) – Train set data matrix.
labels (array of shape (n_individuals, 1)) – Labels for train set.
C (float, optional (default = 1)) – SVM regularization parameter.
gamma (float, optional (default = 1)) – RBF parameter.
doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
missing_strategy (str, optional (default = ‘median’)) – Scikit-learn missing data strategy.
random_state (int, optional (default = 1234)) – Random state.
n_folds (int, optional (default = 5)) – Number of CV folds.
n_repetitions (int, optional (default = 10)) – Number of CV iterations.

ugtm.ugtm_crossvalidate.crossvalidateSVR(data, labels, C=-1, epsilon=-1, doPCA=False, n_components=-1, missing=False, missing_strategy='median', random_state=1234, n_folds=5, n_repetitions=10)[source]

Cross-validate SVR model with linear kernel.

Parameters:

data (array of shape (n_individuals, n_dimensions)) – Train set data matrix.
labels (array of shape (n_individuals, 1)) – Labels for train set.
C (float, optional (default = -1)) – SVM regularization parameter. If (C = -1), different values are tested.
epsilon (float, optional (default = -1)) – SVM tolerance parameter. If (epsilon = -1), different values are tested.
doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
missing_strategy (str, optional (default = ‘median’)) – Scikit-learn missing data strategy.
random_state (int, optional (default = 1234)) – Random state.
n_folds (int, optional (default = 5)) – Number of CV folds.
n_repetitions (int, optional (default = 10)) – Number of CV iterations.

ugtm.ugtm_crossvalidate.whichExperiment(data, labels, args, discrete=False)[source]

ugtm.ugtm_gtm module

Functions to run GTM models.

ugtm.ugtm_gtm.initialize(data, k, m, s, random_state=1234)[source]

Initializes a GTM model.

Parameters:

data (array of shape (n_individuals, n_dimensions)) – Data matrix.
k (int) – Sqrt of the number of GTM nodes. One of four GTM hyperparameters (k, m, s, regul). Ex: k = 25 means the GTM will be discretized into a 25x25 grid.
m (int) – Sqrt of the number of RBF centers. One of four GTM hyperparameters (k, m, s, regul). Ex: m = 5 means the RBF functions will be arranged on a 5x5 grid.
s (float) – RBF width factor. One of four GTM hyperparameters (k, m, s, regul). Parameter to tune width of RBF functions. Impacts manifold flexibility.
random_state (int, optional (default = 1234)) – Random state.

Returns:

Initial GTM model (not optimized).

Return type:

instance of InitialGTM

Notes

We use approximately the same notations as in the original GTM paper by C. Bishop et al. The initialization process is the following:

GTM grid parameters: number of nodes = k*k, number of rbf centers = m*m

Create node matrix X (matX, meshgrid of shape (k*k,2))

Create rbf centers matrix M (matM, meshgrid of shape (m*m, 2))

Initialize rbf width (rbfWidth, computeWidth())

Create rbf matrix $\Phi$ (matPhiMPlusOne, createPhiMatrix())

Perform PCA on the data using sklearn’s PCA function

Set U matrix to 2 first principal axes of data cov. matrix (matU)

Initialize parameter matrix W using U and $\Phi$ (matW, createWMatrix())

Initialize manifold Y using W and $\Phi$ (matY, createYMatrixInit())

Set noise variance parameter (betaInv, evalBetaInv()) to the largest between: (1) the 3rd eigenvalue of the data covariance matrix (2) half the average distance between centers of Gaussian components.

Store initial GTM model in InitialGTM object

ugtm.ugtm_gtm.optimize(data, initialModel, regul, niter, verbose=True)[source]

Optimizes a GTM model.

Parameters:

data (array of shape (n_individuals, n_dimensions)) – Data matrix.
initialModel (instance of InitialGTM) – PCA-initialized GTM model. The initial model is separated from the optimized model so that different data sets can be potentially used for initialization and optimization.
regul (float) – Regularization coefficient.
niter (int) – Number of iterations for EM algorithm.
verbose (bool, optional (default = True)) – Verbose mode (outputs loglikelihood values during EM algorithm).

Returns:

Optimized GTM model.

Return type:

instance of OptimizedGTM

Notes

We use approximately the same notations as in the original GTM paper by C. Bishop et al. The GTM optimization process is the following:

Create distance matrix D between manifold and data matrix (matD, createDistanceMatrix())

Until convergence ($\Delta(log likelihood) \leq 0.0001$):

Update data distribution P (matP, createPMatrix())

Update responsibilities R (matR, createRMatrix())

Update diagonal matrix G (matG, createGMatrix())

Update parameter matrix W (matW, optimWMatrix())

Update manifold matrix Y (matY, createYMatrix())

Update distance matrix D (matD, createDistanceMatrix())

Update noise variance parameter $\beta^{-1}$ (betaInv, optimBetaInv())

Estimate log likelihood and check for convergence (computelogLikelihood())

Compute 2D GTM representation 1: means (matMeans, meanPoint())

Compute 2D GTM representation 2: modes (matModes, modePoint())

Store GTM model in OptimizedGTM object

ugtm.ugtm_gtm.projection(optimizedModel, new_data)[source]

Project test set on optimized GTM model. No pre-processing involved.

Parameters:

optimizedModel (instance of OptimizedGTM) – Optimized GTM model, built using training set (train).
new_data (array of shape (n_test, n_dimensions)) – Test data matrix.

Returns:

Returns an instance of OptimizedGTM corresponding to the projected test set.

Return type:

instance of OptimizedGTM

Notes

The new_data must have been through exactly the same preprocessing as the data used to obtained the optimized GTM model. To get a function doing the preprocessing as well as projection on the map, cf. transform().

ugtm.ugtm_gtm.runGTM(data, k=16, m=4, s=0.3, regul=0.1, doPCA=False, n_components=-1, missing=True, missing_strategy='median', random_state=1234, niter=200, verbose=False)[source]

Run GTM (wrapper for initialize + optimize).

Parameters:

data (array of shape (n_individuals, n_dimensions)) – Data matrix.
k (int, optional (default = 16)) – If k is set to 0, k is computed as sqrt(5*sqrt(n_individuals))+2. k is the sqrt of the number of GTM nodes. One of four GTM hyperparameters (k, m, s, regul). Ex: k = 25 means the GTM will be discretized into a 25x25 grid.
m (int, optional (default = 4)) – If m is set to 0, m is computed as sqrt(k). m is the qrt of the number of RBF centers. One of four GTM hyperparameters (k, m, s, regul). Ex: m = 5 means the RBF functions will be arranged on a 5x5 grid.
s (float, optional (default = 0.3)) – RBF width factor. One of four GTM hyperparameters (k, m, s, regul). Parameter to tune width of RBF functions. Impacts manifold flexibility.
regul (float, optional (default = 0.1)) – One of four GTM hyperparameters (k, m, s, regul). Regularization coefficient.
doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
missing_strategy (str (default = ‘median’)) – Scikit-learn missing data strategy.
random_state (int (default = 1234)) – Random state.
niter (int, optional (default = 200)) – Number of iterations for EM algorithm.
verbose (bool, optional (default = False)) – Verbose mode (outputs loglikelihood values during EM algorithm).

Returns:

Optimized GTM model.

Return type:

instance of OptimizedGTM

Notes

We use approximately the same notations as in the original GTM paper by C. Bishop et al. The initialization + optimization process the following:

GTM initialization:

GTM grid parameters: number of nodes = k*k, number of rbf centers = m*m

Create node matrix X (matX, meshgrid of shape (k*k,2))

Create rbf centers matrix M (matM, meshgrid of shape (m*m, 2))

Initialize rbf width (rbfWidth, computeWidth())

Create rbf matrix $\Phi$ (matPhiMPlusOne, createPhiMatrix())

Perform PCA on the data using sklearn’s PCA function

Set U matrix to 2 first principal axes of data cov. matrix (matU)

Initialize parameter matrix W using U and $\Phi$ (matW, createWMatrix())

Initialize manifold Y using W and $\Phi$ (matY, createYMatrixInit())

Set noise variance parameter (betaInv, evalBetaInv()) to the largest between: (1) the 3rd eigenvalue of the data covariance matrix (2) half the average distance between centers of Gaussian components.

GTM optimization:

Create distance matrix D between manifold and data matrix (matD, createDistanceMatrix())

Until convergence ($\Delta(log likelihood) \leq 0.0001$):

Update data distribution P (matP, createPMatrix())

Update responsibilities R (matR, createRMatrix())

Update diagonal matrix G (matG, createGMatrix())

Update parameter matrix W (matW, optimWMatrix())

Update manifold matrix Y (matY, createYMatrix())

Update distance matrix D (matD, createDistanceMatrix())

Update noise variance parameter $\beta^{-1}$ (betaInv, optimBetaInv())

Estimate log likelihood and check for convergence (computelogLikelihood())

Compute 2D GTM representation 1: means (matMeans, meanPoint())

Compute 2D GTM representation 2: modes (matModes, modePoint())

Store GTM model in OptimizedGTM object

ugtm.ugtm_gtm.transform(optimizedModel, train, test, doPCA=False, n_components=-1, missing=True, missing_strategy='median', random_state=1234, process=True)[source]

Preprocess and project test set on optimized GTM model.

Parameters:

optimizedModel (instance of OptimizedGTM) – Optimized GTM model, built using training set (train).
train (array of shape (n_train, n_dimensions)) – Training data matrix.
test (array of shape (n_test, n_dimensions)) – Test data matrix.
doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
missing_strategy (str (default = ‘median’)) – Scikit-learn missing data strategy.
random_state (int (default = 1234)) – Random state.
process (bool (default = True)) – Apply preprocessing (missing, PCA) to train set and use values from train set to preprocess test set.

Returns:

Returns an instance of OptimizedGTM corresponding to the projected test set.

Return type:

instance of OptimizedGTM

ugtm.ugtm_kgtm module

Functions to initialize and optimize kernel GTM models.

ugtm.ugtm_kgtm.initializeKernel(data, k, m, s, maxdim, random_state=1234)[source]

Initializes a kernel GTM (kGTM) model.

Parameters:

data (array of shape (n_individuals, n_dimensions)) – Data matrix.
k (int) – Sqrt of the number of GTM nodes. Ex: k = 25 means the GTM will be discretized into a 25x25 grid.
m (int) – Sqrt of the number of RBF centers. Ex: m = 5 means the RBF functions will be arranged on a 5x5 grid.
s (float) – RBF width factor. Parameter to tune width of RBF functions. Impacts manifold flexibility.
maxdim (int) – Max boundary for internal dimensionality estimation.
random_state (int, optional (default = 1234)) – Random state.

Returns:

Initial GTM model (not optimized).

Return type:

instance of InitialGTM

ugtm.ugtm_kgtm.optimizeKernel(data, initialModel, regul, niter, verbose=True)[source]

Optimizes a kGTM model.

Parameters:

data (array of shape (n_individuals, n_dimensions)) – Data matrix.
initialModel (instance of InitialGTM) – Initial kGTM model. The initial model is separate from the optimized model so that different data sets can be potentially used for initialization and optimization.
regul (float) – Regularization coefficient.
niter (int) – Number of iterations for EM algorithm.
verbose (bool, optional (default = True)) – Verbose mode (outputs loglikelihood values during EM algorithm).

Returns:

Optimized kGTM model.

Return type:

instance of OptimizedGTM

ugtm.ugtm_kgtm.runkGTM(data, k=16, m=4, s=0.3, regul=0.1, maxdim=100, doPCA=False, doKernel=False, kernel='linear', n_components=-1, missing=True, missing_strategy='median', random_state=1234, niter=200, verbose=False)[source]

Run kGTM algorithm (wrapper for initialize + optimize).

Parameters:

data (array of shape (n_individuals, n_dimensions)) – Data matrix.
k (int, optional (default = 16)) – If k is set to 0, k is computed as sqrt(5*sqrt(n_individuals))+2. k is the sqrt of the number of GTM nodes. One of four GTM hyperparameters (k, m, s, regul). Ex: k = 25 means the GTM will be discretized into a 25x25 grid.
m (int, optional (default = 4)) – If m is set to 0, m is computed as sqrt(k). m is the qrt of the number of RBF centers. One of four GTM hyperparameters (k, m, s, regul). Ex: m = 5 means the RBF functions will be arranged on a 5x5 grid.
s (float, optional (default = 0.3)) – RBF width factor. Parameter to tune width of RBF functions. Impacts manifold flexibility.
regul (float, optional (default = 0.1)) – One of four GTM hyperparameters (k, m, s, regul). Regularization coefficient.
maxdim (int) – Max boundary for internal dimensionality estimation. Internal dimensionality is estimated as number of principal components accounting for 99.5% of data variance. If this value is higher than maxdim, it is replaced by maxim.
doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
doKernel (bool, optional (default = False)) – If doKernel is False, the data is supposed to be a kernel already. If doKernel is True, a kernel will be computed from the data.
kernel (scikit-learn kernel (default = “linear”))
n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
missing_strategy (str (default = ‘median’)) – Scikit-learn missing data strategy.
random_state (int (default = 1234)) – Random state.
niter (int, optional (default = 200)) – Number of iterations for EM algorithm.
verbose (bool, optional (default = False)) – Verbose mode (outputs loglikelihood values during EM algorithm).

Returns:

Optimized kGTM model.

Return type:

instance of OptimizedGTM

ugtm.ugtm_landscape module

Builds continuous GTM class maps or landscapes using labels or activities.

class ugtm.ugtm_landscape.ClassMap(nodeClassP, nodeClassT, activityModel, uniqClasses)[source]

Bases: object

Class for ClassMap: Bayesian classification model for each GTM node.

Parameters:

nodeClassT (array of shape (n_nodes, n_classes)) – Likelihood of each node $k$ given class $C_i$: $P(k|C_i) = \frac{\sum_{i_{c}}R_{i_{c},k}}{N_c}$.
nodeClassP (array of shape (n_nodes, n_classes)) – Posterior probabilities of each class $C_i$ for each node $k$: $P(C_i|k) =\frac{P(k|C_i)P(C_i)}{\sum_i P(k|C_i)P(C_i)}$
activityModel (array of shape (n_nodes,1)) – Class label attributed to each GTM node on the GTM node grid. Computed using argmax of posterior probabilities.
uniqClasses (array of shape (n_classes,1)) – Unique class labels.

ugtm.ugtm_landscape.classMap(optimizedModel, activity, prior='estimated')[source]

Computes GTM class map based on discrete activities (= discrete labels)

Parameters:

optimizedModel (an instance of OptimizedGTM) – The optimized GTM model.
activity (array of shape (n_individuals,1)) – Activity vector (discrete labels) associated with the data used to compute the optimized GTM model.
prior ({estimated, equiprobable}, optional) – Type of prior used for Bayesian classifier. “equiprobable” assigns the same weight to all classes: $P(C_i)=1/N_{classes}$. “estimated” accounts for class imbalance using the number of individuals in each class $N(C_i)$: $P(C_i)=N_{C_i}/N_{total}$

Returns:

Computes a GTM bayesian model and returns an instance of ClassMap.

Return type:

instance of ClassMap

Notes

This function computes the likelihood of each GTM node given a class, the posterior probabilities of each class (using Bayes’ theorem), and the class attributed to each node:

output.nodeClassT: likelihood of each node $k$ given class $C_i$: $P(k|C_i) = \frac{\sum_{i_{c}}R_{i_{c},k}}{N_c}$.

output.nodeClassP: posterior probabilities of each class $C_i$ for each node $k$, using piors $P(C_i)$: $P(C_i|k) =\frac{P(k|C_i)P(C_i)}{\sum_i P(k|C_i)P(C_i)}$

output.activityModel:
Class label attributed to each GTM node on the GTM node grid. Computed using argmax of posterior probabilities.

ugtm.ugtm_landscape.landscape(optimizedModel, activity)[source]

Computes GTM landscapes based on activities (= continuous labels).

Parameters:

optimizedModel (an instance of OptimizedGTM) – The optimized GTM model.
activity (array of shape (n_individuals,1)) – Activity vector (continuous labels) associated with the data used to compute the optimized GTM model.

Returns:

Activity landscape: associates each GTM node $k$ on the GTM node grid with an activity value, which is computed as an average mean of data activity values (continuous labels). If a = activities, r_k = vector of optimized GTM responsibilities for node k, and N = n_individuals: $landscape_k = \frac{\mathbf{a \cdot r}_k}{\sum_i^{N}r_{ik}}$

Return type:

array of shape (n_nodes,1)

ugtm.ugtm_predictions module

GTC (GTM classification) and GTR (GTM regression)

ugtm.ugtm_predictions.GTC(train, labels, test, k=16, m=4, s=0.3, regul=0.1, n_neighbors=1, niter=200, representation='modes', doPCA=False, n_components=-1, missing=False, missing_strategy='median', random_state=1234, predict_mode='bayes', prior='estimated')[source]

Run GTC (GTM classification): Bayes or nearest node algorithm.

Parameters:

train (array of shape (n_train, n_dimensions)) – Train set data matrix.
labels (array of shape (n_train, 1)) – Labels for train set.
test (array of shape (n_test, n_dimensions)) – Test set data matrix.
k (int, optional (default = 16)) – If k is set to 0, k is computed as sqrt(5*sqrt(n_individuals))+2. k is the sqrt of the number of GTM nodes. One of four GTM hyperparameters (k, m, s, regul). Ex: k = 25 means the GTM will be discretized into a 25x25 grid.
m (int, optional (default = 4)) – If m is set to 0, m is computed as sqrt(k). (generally good rule of thumb). m is the qrt of the number of RBF centers. One of four GTM hyperparameters (k, m, s, regul). Ex: m = 5 means the RBF functions will be arranged on a 5x5 grid.
s (float, optional (default = 0.3)) – RBF width factor. One of four GTM hyperparameters (k, m, s, regul). Parameter to tune width of RBF functions. Impacts manifold flexibility.
regul (float, optional (default = 0.1)) – One of four GTM hyperparameters (k, m, s, regul). Regularization coefficient. Impacts manifold flexibility.
n_neighbors (int, optional (default = 1)) – Number of neighbors for kNN algorithm (number of nearest nodes). At the moment, n_neighbors is always equal to 1.
niter (int, optional (default = 200)) – Number of iterations for EM algorithm.
representation ({“modes”, “means”}) – 2D GTM representation for the test set, used for kNN algorithms: “modes” for position with max. responsibility, “means” for average position (usual GTM representation)
doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
missing_strategy (str, optional (default = ‘median’)) – Scikit-learn missing data strategy.
random_state (int, optional (default = 1234)) – Random state.
predict_mode ({“bayes”, “knn”}, optional) – Choose between nearest node algorithm (“knn”, output of predictNN()) or GTM Bayes classifier (“bayes”, output of predictBayes()). NB: the kNN algorithm is limited to only 1 nearest node at the moment (n_neighbors = 1).
prior ({“estimated”, “equiprobable”}, optional) – Type of prior used to build GTM class map (classMap()). Choose “estimated” to account for class imbalance.

Returns:

Predicted class for test set individuals.

Return type:

array of shape (n_test, 1)

Notes

The GTM nearest node classifier (predict_mode = “knn”, predictNN()):

A GTM class map (GTM colored by class) is built using the training set (classMap()); the class map is discretized into nodes, and each node has a class label

The test set is projected onto the GTM map

A 2D GTM representation is chosen for the test set (representation = modes or means)

Nearest node on the GTM map is found for each test set individual

The predicted label for each individual is the label of its nearest node on the GTM map

The GTM Bayes classifier (predict_mode = “bayes”, predictBayes()):

A GTM class map (GTM colored by class) is built using the training set (classMap()); the class map is discretized into nodes, and each node has posterior class probabilities

The test set is projected onto the GTM map

The GTM representation for each individual is its responsibility vector (posterior probability distribution on the map)

The probabilities of belonging to each class for a specific individual are computed as an average of posterior class probabilities (array of shape (n_nodes_n,classes)), weighted by the individual’s responsibilities on the GTM map (array of shape (1, n_nodes))

ugtm.ugtm_predictions.GTR(train, labels, test, k=16, m=4, s=0.3, regul=0.1, n_neighbors=1, niter=200, representation='modes', doPCA=False, n_components=-1, missing=False, missing_strategy='median', random_state=1234)[source]

Run GTR (GTM nearest node(s) regression).

Parameters:

train (array of shape (n_train, n_dimensions)) – Train set data matrix.
labels (array of shape (n_train, 1)) – Labels for train set.
test (array of shape (n_test, n_dimensions)) – Test set data matrix.
k (int, optional (default = 16)) – If k is set to 0, k is computed as sqrt(5*sqrt(n_individuals))+2. k is the sqrt of the number of GTM nodes. One of four GTM hyperparameters (k, m, s, regul). Ex: k = 25 means the GTM will be discretized into a 25x25 grid.
m (int, optional (default = 4)) – If m is set to 0, m is computed as sqrt(k). (generally good rule of thumb). m is the qrt of the number of RBF centers. One of four GTM hyperparameters (k, m, s, regul). Ex: m = 5 means the RBF functions will be arranged on a 5x5 grid.
s (float, optional (default = 0.3)) – RBF width factor. One of four GTM hyperparameters (k, m, s, regul). Parameter to tune width of RBF functions. Impacts manifold flexibility.
regul (float, optional (default = 0.1)) – One of four GTM hyperparameters (k, m, s, regul). Regularization coefficient. Impacts manifold flexibility.
n_neighbors (int, optional (default = 1)) – Number of neighbors for kNN algorithm (number of nearest nodes).
niter (int, optional (default = 200)) – Number of iterations for EM algorithm.
representation ({“modes”, “means”}) – 2D GTM representation for the test set, used for kNN algorithms: “modes” for position with max. responsibility, “means” for average position (usual GTM representation)
doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
missing_strategy (str, optional (default = ‘median’)) – Scikit-learn missing data strategy.
random_state (int, optional (default = 1234)) – Random state.

Returns:

Predicted class for test set individuals.

Return type:

array of shape (n_test, 1)

Notes

The GTM nearest node(s) regression (predictNN()):

A GTM landscape (GTM colored by activity) is built using the training set (landscape()); the landscape is discretized into nodes, and each node has an estimated activity value

The test set is projected onto the GTM map

A 2D GTM representation is chosen for the test set (representation = modes or means)

Nearest node(s) on the GTM map is found for each test set individual

The predicted activity for each individual is a weighted average of nearest node activities.

ugtm.ugtm_predictions.advancedGTC(train, labels, test, n_neighbors=1, representation='modes', niter=200, k=16, m=4, regul=0.1, s=0.3, doPCA=False, n_components=-1, missing=False, missing_strategy='median', random_state=1234, predict_mode='bayes', prior='estimated')[source]

Run GTC (GTM classification): advanced Bayes

Parameters:

train (array of shape (n_train, n_dimensions)) – Train set data matrix.
labels (array of shape (n_train, 1)) – Labels for train set.
test (array of shape (n_test, n_dimensions)) – Test set data matrix.
k (int, optional (default = 16)) – If k is set to 0, k is computed as sqrt(5*sqrt(n_individuals))+2. k is the sqrt of the number of GTM nodes. One of four GTM hyperparameters (k, m, s, regul). Ex: k = 25 means the GTM will be discretized into a 25x25 grid.
m (int, optional (default = 4)) – If m is set to 0, m is computed as sqrt(k). (generally good rule of thumb). m is the qrt of the number of RBF centers. One of four GTM hyperparameters (k, m, s, regul). Ex: m = 5 means the RBF functions will be arranged on a 5x5 grid.
s (float, optional (default = 0.3)) – RBF width factor. One of four GTM hyperparameters (k, m, s, regul). Parameter to tune width of RBF functions. Impacts manifold flexibility.
regul (float, optional (default = 0.1)) – One of four GTM hyperparameters (k, m, s, regul). Regularization coefficient. Impacts manifold flexibility.
n_neighbors (int, optional (default = 1)) – Number of neighbors for kNN algorithm (number of nearest nodes). At the moment, n_neighbors is always equal to 1.
niter (int, optional (default = 200)) – Number of iterations for EM algorithm.
representation ({“modes”, “means”}) – 2D GTM representation for the test set, used for kNN algorithms: “modes” for position with max. responsibility, “means” for average position (usual GTM representation)
doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
missing_strategy (str, optional (default = ‘median’)) – Scikit-learn missing data strategy.
random_state (int, optional (default = 1234)) – Random state.
predict_mode ({“bayes”}, optional) – At the moment, only the GTM Bayes classifier is available; (“bayes”, output of advancedPredictBayes()).
prior ({“estimated”, “equiprobable”}, optional) – Type of prior used to build GTM class map (classMap()). Choose “estimated” to account for class imbalance.

Returns:

The output is a dictionary defined as follows:

output[“optimizedModel”]: original training set GTM model, instance of OptimizedGTM

output[“indiv_projections”]: test set GTM model, instance of OptimizedGTM

output[“indiv_probabilities”]: class probabilities for each individual (= dot product between test responsibility matrix and posterior class probabilities)

output[“indiv_predictions”]: class prediction for each individual (argmax of output[“indiv_probabilities”])

output[“group_projections”]: average responsibility vector for the entire test set

output[“group_probabilities”]: posterior class probabilities for the entire test set (dot product between output[“group_projections”] and posterior class probabilities)

output[“uniqClasses”]: classes

Return type:

a dict

Notes

The GTM nearest node classifier (predict_mode = “knn”, predictNN()):

A GTM class map (GTM colored by class) is built using the training set (classMap()); the class map is discretized into nodes, and each node has a class label

The test set is projected onto the GTM map

A 2D GTM representation is chosen for the test set (representation = modes or means)

Nearest node on the GTM map is found for each test set individual

The predicted label for each individual is the label of its nearest node on the GTM map

The GTM Bayes classifier (predict_mode = “bayes”, predictBayes()):

A GTM class map (GTM colored by class) is built using the training set (classMap()); the class map is discretized into nodes, and each node has posterior class probabilities

The test set is projected onto the GTM map

The GTM representation for each individual is its responsibility vector (posterior probability distribution on the map)

The probabilities of belonging to each class for a specific individual are computed as an average of posterior class probabilities (array of shape (n_nodes_n,classes)), weighted by the individual’s responsibilities on the GTM map (array of shape (1, n_nodes))

ugtm.ugtm_predictions.advancedPredictBayes(optimizedModel, labels, new_data, prior='estimated')[source]

Bayesian GTM classifier: complete model

Parameters:

optimizedModel (instance of OptimizedGTM) – Optimized GTM model built using a training set of shape (n_individuals, n_dimensions)
labels (array of shape (n_individuals, 1)) – Labels (discrete or continuous) associated with training set
new_data (array of shape (n_test, n_dimensions)) – New data matrix (test set).
prior ({‘estimated’, ‘equiprobable’}, optional) – Only used for classification. Sets priors (Bayes’ theorem) in classMap().

Returns:

The output is a dictionary defined as follows:

output[“optimizedModel”]: original training set GTM model, instance of OptimizedGTM

output[“indiv_projections”]: test set GTM model, instance of OptimizedGTM

output[“indiv_probabilities”]: class probabilities for each individual (= dot product between test responsibility matrix and posterior class probabilities)

output[“indiv_predictions”]: class prediction for each individual (argmax of output[“indiv_probabilities”])

output[“group_projections”]: average responsibility vector for the entire test set

output[“group_probabilities”]: posterior class probabilities for the entire test set (dot product between output[“group_projections”] and posterior class probabilities)

output[“uniqClasses”]: classes

Return type:

a dict

Notes

This function computes GTM class predictions by using posterior probabilities of classes weighted by responsibilities.

generate GTM class map (classMap());

Project new data (projection()) on optimized GTM model (OptimizedGTM)

Projected data responsibilities R are used as weights to find outcome $C_{max}$ for each tested instance: $C_{max} = \operatorname*{arg\,max}_C \sum_k{R_{ki} P(C|k)}$

The algorithm is the same as in predictBayes(), but this function returns a complete output including original training set optimized GTM model, test set GTM model, individual class probabilities for each individual, class prediction for each individual, group projections (average position of the whole test set on the map), class probabilities for the whole test set, and classes used to build the classification model.

ugtm.ugtm_predictions.predictBayes(optimizedModel, labels, new_data, prior='estimated')[source]

Bayesian GTM classifier (GTC Bayes).

Parameters:

optimizedModel (instance of OptimizedGTM) – Optimized GTM model built using a training set of shape (n_individuals, n_dimensions)
labels (array of shape (n_individuals, 1)) – Labels (discrete or continuous) associated with training set
new_data (array of shape (n_test, n_dimensions)) – New data matrix (test set).
prior ({‘estimated’, ‘equiprobable’}, optional) – Only used for classification. Sets priors (Bayes’ theorem) in classMap().

Returns:

Predicted outcome.

Return type:

array of shape (n_test, 1)

Notes

This function computes GTM class predictions by using posterior probabilities of classes weighted by responsibilities. Similar to maximum a posterior (MAP) estimator.

generate GTM class map (classMap());

Project new data (projection()) on optimized GTM model (OptimizedGTM)

Projected data responsibilities R are used as weights to find outcome $C_{max}$ for each tested instance: $C_{max} = \operatorname*{arg\,max}_C \sum_k{R_{ki} P(C|k)}$

ugtm.ugtm_predictions.predictNN(optimizedModel, labels, new_data, modeltype='regression', n_neighbors=1, representation='modes', prior='estimated')[source]

GTM nearest node(s) classification or regression.

Parameters:

optimizedModel (instance of OptimizedGTM) – Optimized GTM model built using a training set of shape (n_individuals, n_dimensions)
labels (array of shape (n_individuals, 1)) – Labels (discrete or continuous) associated with training set
new_data (array of shape (n_test, n_dimensions)) – New data matrix (test set).
modeltype ({‘classification’, ‘regression’}, optional) – Choice between classification and regression.
n_neighbors (int, optional (default = 1)) – Number of nodes to take into account in kNN algorithm. NB: for classification, n_neighbors is always equal to 1.
representation ({‘modes’, ‘means’}, optional) – Defines GTM representation type: mean or mode of responsibilities.
prior ({‘estimated’, ‘equiprobable’}, optional) – Only used for classification. Sets priors (Bayes’ theorem) in classMap().

Returns:

Predicted outcome.

Return type:

array of shape (n_test, 1)

Notes

This function implements classification or regression based on nearest GTM nodes:

If (modeltype == ‘classification’), generate GTM class map (classMap()); if (modeltype == ‘regression’), generate GTM landscape (landscape())

Project new data (projection()) on optimized GTM model (OptimizedGTM)

Depending on provided parameters, choose means or modes as GTM coordinates for the new data

Find the nodes closest to the new data GTM coordinates (sklearn function kneighbors)

Retrieve predicted outcomes corresponding to nodes on class map (classification task) or landscape (regression task)

If (modeltype == ‘classification’), the predicted outcome is the outcome of the nearest node on the class map; if (modeltype == ‘regression’), the predicted outcome is the average outcome of the k nearest nodes (k = n_neighbors), weighted by inverse squared distances (weights=1/((dist)**2))

ugtm.ugtm_predictions.predictNNSimple(train, test, labels, n_neighbors=1, modeltype='regression')[source]

Nearest neighbor(s) classification or regression.

Parameters:

train (array of shape (n_train, n_dimensions)) – Train set data matrix.
test (array of shape (n_test, n_dimensions)) – Test set data matrix.
labels (array of shape (n_train, 1)) – Labels (discrete or continuous) for the training set.
n_neighbors (int, optional (default = 1)) – Number of nodes to take into account in kNN algorithm.
modeltype ({‘classification’, ‘regression’}, optional) – Choice between classification and regression.

Returns:

Predicted outcome.

Return type:

array of shape (n_test, 1)

Notes

This function implements classification or regression based on classical kNN algorithm.

ugtm.ugtm_predictions.printClassPredictions(prediction, output)[source]

Print output of advancedPredictBayes().

Parameters:

prediction (dict) – Output of advancedPredictBayes(). With following keys: “optimizedModel”: OptimizedGTM, “indiv_projections”: OptimizedGTM, “indiv_probabilities”: array of shape (n_individuals, n_classes), “indiv_predictions”: array of shape (n_individuals, 1), “group_projections”: array of shape (n_nodes, 1), “group_probabilities”: array of shape (n_probabilities, 1), “uniqClasses”: array of shape(n_classes, 1)
output (str) – Output path to write class prediction model (prediction dictionary).

Returns:

output_indiv_probabilities.csv
output_indiv_predictions.csv
output_group_probabilities.csv

Return type:

CSV files

ugtm.ugtm_preprocess module

Preprocessing operations (mostly using scikit-learn functions).

class ugtm.ugtm_preprocess.ProcessedTrainTest(train, test)[source]

Bases: object

Class for processed train and test set.

Parameters:

train (array of shape (n_train, n_dimensions)) – Train data matrix.
test (array of shape (n_test, ndimensions)) – Test data matrix.

ugtm.ugtm_preprocess.chooseKernel(data, kerneltype='euclidean')[source]

Kernalize data (uses sklearn)

Parameters:

data (array of shape (n_individuals, n_dimensions)) – Data matrix.
kerneltype ({‘euclidean’, ‘cosine’, ‘laplacian’, ‘polynomial_kernel’, ‘jaccard’}, optional) – Kernel type.

Return type:

array of shape (n_individuals, n_individuals)

ugtm.ugtm_preprocess.pcaPreprocess(data, doPCA=False, n_components=-1, missing=False, missing_strategy='median', random_state=1234)[source]

Preprocess data using PCA.

Parameters:

data (array of shape (n_individuals, n_dimensions)) – Data matrix.
doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
missing_strategy (str (default = ‘median’)) – Scikit-learn missing data strategy.
random_state (int (default = 1234)) – Random state.

Returns:

Data projected onto principal axes.

Return type:

array of shape (n_individuals, n_components)

ugtm.ugtm_preprocess.processTrainTest(train, test, doPCA, n_components, missing=False, missing_strategy='median', random_state=1234)[source]

Preprocess train and test data using PCA.

Parameters:

train (array of shape (n_individuals, n_train)) – Train data matrix.
test (array of shape (n_individuals, n_test)) – Test data matrix.
doPCA (bool, optional (default = False)) – Apply PCA pre-processing.
n_components (int, optional (default = -1)) – Number of components for PCA pre-processing. If set to -1, keep principal components accounting for 80% of data variance.
missing (bool, optional (default = True)) – Replace missing values (calls scikit-learn functions).
missing_strategy (str (default = ‘median’)) – Scikit-learn missing data strategy.
random_state (int (default = 1234)) – Random state.

Return type:

instance of ProcessedTrainTest

ugtm.ugtm_sklearn module

GTM transformer, classifier and regressor compatible with sklearn

class ugtm.ugtm_sklearn.eGTC(k=16, m=4, s=0.3, regul=0.1, random_state=1234, niter=200, verbose=False, prior='estimated')[source]

Bases: BaseEstimator, ClassifierMixin

eGTC : GTC Bayesian classifier for sklearn pipelines.

Parameters:

k (int, optional (default = 16)) – If k is set to 0, k is computed as sqrt(5*sqrt(n_individuals))+2. k is the sqrt of the number of GTM nodes. One of four GTM hyperparameters (k, m, s, regul). Ex: k = 25 means the GTM will be discretized into a 25x25 grid.
m (int, optional (default = 4)) – If m is set to 0, m is computed as sqrt(k). m is the qrt of the number of RBF centers. One of four GTM hyperparameters (k, m, s, regul). Ex: m = 5 means the RBF functions will be arranged on a 5x5 grid.
s (float, optional (default = 0.3)) – RBF width factor. One of four GTM hyperparameters (k, m, s, regul). Parameter to tune width of RBF functions. Impacts manifold flexibility.
regul (float, optional (default = 0.1)) – One of four GTM hyperparameters (k, m, s, regul). Regularization coefficient.
random_state (int (default = 1234)) – Random state.
niter (int, optional (default = 200)) – Number of iterations for EM algorithm.
verbose (bool, optional (default = False)) – Verbose mode (outputs loglikelihood values during EM algorithm).
prior ({‘estimated’, ‘equiprobable’}) – Type of prior for class map. Use ‘estimated’ to account for class imbalance.

fit(X, y)[source]

Constructs activity model f(X,y) using classMap().

Parameters:

X (array of shape (n_instances, n_dimensions)) – Data matrix.
y (array of shape (n_instances,)) – Data labels.

predict(X)[source]

Predicts new labels for X using projection().

Parameters:: X (array of shape (n_instances, n_dimensions)) – Data matrix.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → eGTC

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:: sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.
Returns:: self – The updated object.
Return type:: object

class ugtm.ugtm_sklearn.eGTCnn(k=16, m=4, s=0.3, regul=0.1, random_state=1234, niter=200, verbose=False, prior='estimated', representation='modes')[source]

Bases: BaseEstimator, RegressorMixin

eGTCnn: GTC nearest node classifier for sklearn pipelines.

Parameters:

k (int, optional (default = 16)) – If k is set to 0, k is computed as sqrt(5*sqrt(n_individuals))+2. k is the sqrt of the number of GTM nodes. One of four GTM hyperparameters (k, m, s, regul). Ex: k = 25 means the GTM will be discretized into a 25x25 grid.
m (int, optional (default = 4)) – If m is set to 0, m is computed as sqrt(k). m is the qrt of the number of RBF centers. One of four GTM hyperparameters (k, m, s, regul). Ex: m = 5 means the RBF functions will be arranged on a 5x5 grid.
s (float, optional (default = 0.3)) – RBF width factor. One of four GTM hyperparameters (k, m, s, regul). Parameter to tune width of RBF functions. Impacts manifold flexibility.
regul (float, optional (default = 0.1)) – One of four GTM hyperparameters (k, m, s, regul). Regularization coefficient.
random_state (int (default = 1234)) – Random state.
niter (int, optional (default = 200)) – Number of iterations for EM algorithm.
verbose (bool, optional (default = False)) – Verbose mode (outputs loglikelihood values during EM algorithm).
prior ({‘estimated’, ‘equiprobable’}) – Type of prior for class map. Use ‘estimated’ to account for class imbalance.
representation ({‘modes’, ‘means’}, optional) – Type of 2D representation used in kNN algorithm.

fit(X, y)[source]

Constructs activity model f(X,y) using classMap().

Parameters:

X (array of shape (n_instances, n_dimensions)) – Data matrix.
y (array of shape (n_instances,)) – Data labels.

predict(X)[source]

Predicts new labels for X using projection().

Parameters:: X (array of shape (n_instances, n_dimensions)) – Data matrix.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → eGTCnn

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:: sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.
Returns:: self – The updated object.
Return type:: object

class ugtm.ugtm_sklearn.eGTM(k=16, m=4, s=0.3, regul=0.1, random_state=1234, niter=200, verbose=False, model='means')[source]

Bases: BaseEstimator, TransformerMixin

eGTM: GTM Transformer for sklearn pipeline.

Parameters:

k (int, optional (default = 16)) – If k is set to 0, k is computed as sqrt(5*sqrt(n_individuals))+2. k is the sqrt of the number of GTM nodes. One of four GTM hyperparameters (k, m, s, regul). Ex: k = 25 means the GTM will be discretized into a 25x25 grid.
m (int, optional (default = 4)) – If m is set to 0, m is computed as sqrt(k). m is the qrt of the number of RBF centers. One of four GTM hyperparameters (k, m, s, regul). Ex: m = 5 means the RBF functions will be arranged on a 5x5 grid.
s (float, optional (default = 0.3)) – RBF width factor. One of four GTM hyperparameters (k, m, s, regul). Parameter to tune width of RBF functions. Impacts manifold flexibility.
regul (float, optional (default = 0.1)) – One of four GTM hyperparameters (k, m, s, regul). Regularization coefficient.
random_state (int (default = 1234)) – Random state.
niter (int, optional (default = 200)) – Number of iterations for EM algorithm.
verbose (bool, optional (default = False)) – Verbose mode (outputs loglikelihood values during EM algorithm).
model ({‘means’, ‘modes’, ‘responsibilities’,’complete’}, optional) – GTM data representations: ‘means’ for mean data positions, ‘modes’ for positions with max. responsibilities, ‘responsibilities’ for probability distribution on the map, ‘complete’ for a complete instance of OptimizedGTM

fit(X, y=None)[source]

Fits GTM to X using OptimizedGTM.

Parameters:: X (2D array) – Data matrix.

fit_transform(X, y=None)[source]

Fits and transforms X using GTM.

Parameters:

X (2D array) – Data matrix.

Returns:

if self.model=”means”, array of shape (n_instances, 2),
if self.model=”modes”, array of shape (n_instances, 2),
if self.model=”responsibilities”, array of shape (n_instances, n_nodes),
if self.model=”complete”, instance of class OptimizedGTM

inverse_transform(matR)[source]

Inverse transformation of responsibility onto the original data space

Parameters:: matR (array of shape (n_samples, n_nodes))
Returns:: matY
Return type:: array of shape (n_samples, n_dimensions)

set_inverse_transform_request(*, matR: bool | None | str = '$UNCHANGED$') → eGTM

Configure whether metadata should be requested to be passed to the inverse_transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to inverse_transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to inverse_transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:: matR (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for matR parameter in inverse_transform.
Returns:: self – The updated object.
Return type:: object

transform(X)[source]

Projects new data X onto GTM using projection().

Parameters:

X (2D array) – Data matrix.

Returns:

if self.model=”means”, array of shape (n_instances, 2),
if self.model=”modes”, array of shape (n_instances, 2),
if self.model=”responsibilities”, array of shape (n_instances, n_nodes),
if self.model=”complete”, instance of class OptimizedGTM

class ugtm.ugtm_sklearn.eGTR(k=16, m=4, s=0.3, regul=0.1, random_state=1234, niter=200, verbose=False, n_neighbors=2, representation='modes')[source]

Bases: BaseEstimator, RegressorMixin

eGTR: GTM nearest node(s) regressor for sklearn pipelines.

Parameters:

k (int, optional (default = 16)) – If k is set to 0, k is computed as sqrt(5*sqrt(n_individuals))+2. k is the sqrt of the number of GTM nodes. One of four GTM hyperparameters (k, m, s, regul). Ex: k = 25 means the GTM will be discretized into a 25x25 grid.
m (int, optional (default = 4)) – If m is set to 0, m is computed as sqrt(k). m is the qrt of the number of RBF centers. One of four GTM hyperparameters (k, m, s, regul). Ex: m = 5 means the RBF functions will be arranged on a 5x5 grid.
s (float, optional (default = 0.3)) – RBF width factor. One of four GTM hyperparameters (k, m, s, regul). Parameter to tune width of RBF functions. Impacts manifold flexibility.
regul (float, optional (default = 0.1)) – One of four GTM hyperparameters (k, m, s, regul). Regularization coefficient.
random_state (int (default = 1234)) – Random state.
niter (int, optional (default = 200)) – Number of iterations for EM algorithm.
verbose (bool, optional (default = False)) – Verbose mode (outputs loglikelihood values during EM algorithm).
prior ({‘estimated’, ‘equiprobable’}) – Type of prior for class map. Use ‘estimated’ to account for class imbalance.
n_neighbors (int, optional (default = 2)) – Number of neighbors for kNN algorithm.
representation ({‘modes’, ‘means’}, optional) – Type of 2D representation used in kNN algorithm.

fit(X, y)[source]

Constructs activity model f(X,y) using landscape().

Parameters:

X (array of shape (n_instances, n_dimensions)) – Data matrix.
y (array of shape (n_instances,)) – Data labels.

predict(X)[source]

Predicts new labels for X using projection().

Parameters:: X (array of shape (n_instances, n_dimensions)) – Data matrix.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → eGTR

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:: sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.
Returns:: self – The updated object.
Return type:: object

class ugtm.ugtm_sklearn.eIGTM(k=16, m=4, s=0.3, regul=0.1, random_state=1234, niter=200, verbose=False, model='means', n_blocks=0)[source]

Bases: BaseEstimator, TransformerMixin

eIGTM: incremental GTM Transformer for sklearn pipelines.

Fits a GTM model using block-wise EM (Gaspar et al. 2014), suitable for large datasets where the full N×K responsibility matrix does not fit in memory. The full matrix is never formed; only two (n_nodes,)-shaped accumulators are kept per iteration.

Parameters:

k (int, optional (default = 16)) – Sqrt of the number of GTM nodes (0 = auto).
m (int, optional (default = 4)) – Sqrt of the number of RBF centers (0 = auto).
s (float, optional (default = 0.3)) – RBF width factor.
regul (float, optional (default = 0.1)) – Regularization coefficient.
random_state (int (default = 1234)) – Random state.
niter (int, optional (default = 200)) – Maximum EM iterations.
verbose (bool, optional (default = False)) – Verbose mode.
model ({‘means’, ‘modes’, ‘responsibilities’, ‘complete’}, optional) – Output representation returned by transform().
n_blocks (int, optional (default = 0)) – Number of data blocks. 0 = auto (ceil(N / 5000)).

fit(X, y=None)[source]

Fits iGTM to X using block-wise EM.

Parameters:: X (2D array) – Data matrix.

fit_transform(X, y=None)[source]

Fits iGTM to X and returns the training-set representation.

For model='means' and model='modes' the values computed during the final block pass of fit() are returned directly, avoiding an extra projection pass.

Parameters:: X (2D array) – Data matrix.
Return type:: See transform().

inverse_transform(matR)[source]

Maps responsibility vectors back to the original data space.

Parameters:: matR (array of shape (n_samples, n_nodes))
Return type:: array of shape (n_samples, n_dimensions)

set_inverse_transform_request(*, matR: bool | None | str = '$UNCHANGED$') → eIGTM

Configure whether metadata should be requested to be passed to the inverse_transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to inverse_transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to inverse_transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:: matR (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for matR parameter in inverse_transform.
Returns:: self – The updated object.
Return type:: object

transform(X)[source]

Projects X onto the fitted iGTM using a single E-step pass.

Parameters:

X (2D array) – Data matrix.

Returns:

if self.model=’means’, array of shape (n_instances, 2),
if self.model=’modes’, array of shape (n_instances, 2),
if self.model=’responsibilities’, array of shape (n_instances, n_nodes),
if self.model=’complete’, instance of OptimizedGTM

transform_blocks(X, block_size=5000)[source]

Project X onto the fitted iGTM block-by-block (generator).

Yields one block’s result at a time so peak memory is proportional to block_size × n_nodes rather than N × n_nodes. Useful when X is large or when model='responsibilities' and the full N×K matrix would not fit in RAM.

Parameters:

X (2D array) – Data matrix.
block_size (int, optional (default = 5000)) – Number of rows per yielded block.

Yields:

Same type as transform(), but for each block of rows.
For ``model=’means’`` or ``model=’modes’`` (array of shape)
(block_size, 2) (last block may be smaller).
For ``model=’responsibilities’`` (array of shape)
(block_size, n_nodes).
For ``model=’complete’`` (instance of)
OptimizedGTM.

Module contents

ugtm: a python package for Generative Topographic Mapping (GTM)

Submodules

`ugtm_sklearn`	GTM transformer, classifier and regressor compatible with sklearn
`ugtm_gtm`	Functions to run GTM models.
`ugtm_kgtm`	Functions to initialize and optimize kernel GTM models.
`ugtm_classes`	Defines classes for initial and optimized GTM model.
`ugtm_landscape`	Builds continuous GTM class maps or landscapes using labels or activities.
`ugtm_predictions`	GTC (GTM classification) and GTR (GTM regression)
`ugtm_crossvalidate`	Cross-validation support for GTC and GTR models (also SVM and PCA).
`ugtm_preprocess`	Preprocessing operations (mostly using scikit-learn functions).