eIGTM: incremental GTM

Overview

eIGTM is a memory-efficient, sklearn-compatible GTM transformer for large datasets (Gaspar et al. 2014). Standard GTM holds the full N×K responsibility matrix in RAM at every EM iteration. iGTM processes the data in blocks, accumulating only two small sufficient-statistic arrays per iteration:

g_vec — shape (n_nodes,): accumulated row sums of R across blocks
RT_acc — shape (n_nodes, n_dimensions): accumulated R @ X across blocks

Peak memory per iteration is therefore O(block_size × n_nodes) rather than O(N × n_nodes). The W update is mathematically equivalent to standard GTM; only the β⁻¹ update uses the block-local distances rather than recomputed distances, which introduces a minor approximation that has negligible effect on the final manifold (Pearson r > 0.999 between GTM and iGTM coordinates in practice).

Run eIGTM

The API mirrors eGTM with one extra parameter n_blocks (0 = auto, set to ceil(N / 5000)):

from ugtm import eIGTM
import numpy as np

X_train = np.random.randn(10000, 50)
X_test  = np.random.randn(1000, 50)

# Fit iGTM on X_train; blocks chosen automatically
model = eIGTM().fit(X_train)

# 2D projection of X_test
transformed = model.transform(X_test)

# Or fit and transform in one call
transformed_train = eIGTM().fit_transform(X_train)

For the low-level wrapper (mirrors runGTM()):

from ugtm import runIGTM
import numpy as np

data = np.random.randn(10000, 50)
model = runIGTM(data, n_blocks=5)

# Access 2D coordinates
coordinates = model.matMeans
modes       = model.matModes

Block-wise projection for large test sets

When the test set is also large, use the transform_blocks() generator so that peak memory stays bounded:

from ugtm import eIGTM
import numpy as np

X_train = np.random.randn(10000, 50)
X_test  = np.random.randn(10000, 50)

model = eIGTM().fit(X_train)

# Yields one (block_size, 2) array at a time
for block_coords in model.transform_blocks(X_test, block_size=1000):
    # process or write block_coords here
    pass

# To collect all at once (only if result fits in RAM):
import numpy as np
coords = np.vstack(list(model.transform_blocks(X_test, block_size=1000)))

For model='responsibilities', transform_blocks yields (block_size, n_nodes) arrays, avoiding the N×K matrix entirely:

model = eIGTM(model='responsibilities').fit(X_train)
for resp_block in model.transform_blocks(X_test, block_size=1000):
    # resp_block.shape == (1000, n_nodes)
    pass

Choosing n_blocks

With the default n_blocks=0, blocks are sized to ~5 000 rows each. For a dataset of N rows:

N	Auto n_blocks
≤ 5 000	1 (identical to standard GTM)
10 000	2
50 000	10
1 000 000	200

Setting n_blocks=1 reproduces standard GTM behaviour (with the minor β⁻¹ approximation noted above).

Visualize projection

from ugtm import eIGTM
import numpy as np
import altair as alt
import pandas as pd

np.random.seed(0)
X_train = np.random.randn(100, 10)
X_test  = np.random.randn(50, 10)
labels  = np.random.choice(['A', 'B', 'C'], size=50)

transformed = eIGTM(n_blocks=2).fit(X_train).transform(X_test)

df = pd.DataFrame(transformed, columns=["x1", "x2"])
df["label"] = labels

alt.Chart(df).mark_point(size=60).encode(
    x='x1', y='x2',
    color=alt.Color('label:N', scale=alt.Scale(scheme='set1')),
    tooltip=['x1', 'x2', 'label']
).properties(title="iGTM projection of X_test", width=300, height=300).interactive()

Parameter optimization

eIGTM is sklearn-compatible and works with GridSearchCV:

from ugtm import eIGTM
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

X_train = np.random.randn(200, 20)
y_train = np.random.choice([0, 1], size=200)

pipe = Pipeline([
    ('igtm', eIGTM()),
    ('knn',  KNeighborsClassifier()),
])

param_grid = {
    'igtm__k':     [4, 8],
    'igtm__s':     [0.3, 1.0],
    'igtm__regul': [0.01, 0.1],
}

gs = GridSearchCV(pipe, param_grid, cv=3, scoring='accuracy')
gs.fit(X_train, y_train)
print(gs.best_params_)