eIGTM: incremental GTM
Overview
eIGTM is a memory-efficient, sklearn-compatible
GTM transformer for large datasets (Gaspar et al. 2014).
Standard GTM holds the full N×K responsibility matrix in RAM at every EM
iteration. iGTM processes the data in blocks, accumulating only two small
sufficient-statistic arrays per iteration:
g_vec — shape (n_nodes,): accumulated row sums of R across blocks
RT_acc — shape (n_nodes, n_dimensions): accumulated R @ X across blocks
Peak memory per iteration is therefore O(block_size × n_nodes) rather than O(N × n_nodes). The W update is mathematically equivalent to standard GTM; only the β⁻¹ update uses the block-local distances rather than recomputed distances, which introduces a minor approximation that has negligible effect on the final manifold (Pearson r > 0.999 between GTM and iGTM coordinates in practice).
Run eIGTM
The API mirrors eGTM with one extra parameter
n_blocks (0 = auto, set to ceil(N / 5000)):
from ugtm import eIGTM
import numpy as np
X_train = np.random.randn(10000, 50)
X_test = np.random.randn(1000, 50)
# Fit iGTM on X_train; blocks chosen automatically
model = eIGTM().fit(X_train)
# 2D projection of X_test
transformed = model.transform(X_test)
# Or fit and transform in one call
transformed_train = eIGTM().fit_transform(X_train)
For the low-level wrapper (mirrors runGTM()):
from ugtm import runIGTM
import numpy as np
data = np.random.randn(10000, 50)
model = runIGTM(data, n_blocks=5)
# Access 2D coordinates
coordinates = model.matMeans
modes = model.matModes
Block-wise projection for large test sets
When the test set is also large, use the
transform_blocks() generator so that peak
memory stays bounded:
from ugtm import eIGTM
import numpy as np
X_train = np.random.randn(10000, 50)
X_test = np.random.randn(10000, 50)
model = eIGTM().fit(X_train)
# Yields one (block_size, 2) array at a time
for block_coords in model.transform_blocks(X_test, block_size=1000):
# process or write block_coords here
pass
# To collect all at once (only if result fits in RAM):
import numpy as np
coords = np.vstack(list(model.transform_blocks(X_test, block_size=1000)))
For model='responsibilities', transform_blocks yields
(block_size, n_nodes) arrays, avoiding the N×K matrix entirely:
model = eIGTM(model='responsibilities').fit(X_train)
for resp_block in model.transform_blocks(X_test, block_size=1000):
# resp_block.shape == (1000, n_nodes)
pass
Choosing n_blocks
With the default n_blocks=0, blocks are sized to ~5 000 rows each.
For a dataset of N rows:
N |
Auto n_blocks |
|---|---|
≤ 5 000 |
1 (identical to standard GTM) |
10 000 |
2 |
50 000 |
10 |
1 000 000 |
200 |
Setting n_blocks=1 reproduces standard GTM behaviour (with the minor
β⁻¹ approximation noted above).
Visualize projection
from ugtm import eIGTM
import numpy as np
import altair as alt
import pandas as pd
np.random.seed(0)
X_train = np.random.randn(100, 10)
X_test = np.random.randn(50, 10)
labels = np.random.choice(['A', 'B', 'C'], size=50)
transformed = eIGTM(n_blocks=2).fit(X_train).transform(X_test)
df = pd.DataFrame(transformed, columns=["x1", "x2"])
df["label"] = labels
alt.Chart(df).mark_point(size=60).encode(
x='x1', y='x2',
color=alt.Color('label:N', scale=alt.Scale(scheme='set1')),
tooltip=['x1', 'x2', 'label']
).properties(title="iGTM projection of X_test", width=300, height=300).interactive()
Parameter optimization
eIGTM is sklearn-compatible and works with
GridSearchCV:
from ugtm import eIGTM
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
X_train = np.random.randn(200, 20)
y_train = np.random.choice([0, 1], size=200)
pipe = Pipeline([
('igtm', eIGTM()),
('knn', KNeighborsClassifier()),
])
param_grid = {
'igtm__k': [4, 8],
'igtm__s': [0.3, 1.0],
'igtm__regul': [0.01, 0.1],
}
gs = GridSearchCV(pipe, param_grid, cv=3, scoring='accuracy')
gs.fit(X_train, y_train)
print(gs.best_params_)