========================
eIGTM: incremental GTM
========================

Overview
--------

:class:`~ugtm.ugtm_sklearn.eIGTM` is a memory-efficient, sklearn-compatible
GTM transformer for large datasets (Gaspar et al. 2014).
Standard GTM holds the full N×K responsibility matrix in RAM at every EM
iteration. iGTM processes the data in blocks, accumulating only two small
sufficient-statistic arrays per iteration:

* **g_vec** — shape (n_nodes,): accumulated row sums of R across blocks
* **RT_acc** — shape (n_nodes, n_dimensions): accumulated R @ X across blocks

Peak memory per iteration is therefore O(block_size × n_nodes) rather than
O(N × n_nodes). The W update is mathematically equivalent to standard GTM;
only the β⁻¹ update uses the block-local distances rather than recomputed
distances, which introduces a minor approximation that has negligible effect
on the final manifold (Pearson r > 0.999 between GTM and iGTM coordinates
in practice).


Run eIGTM
----------

The API mirrors :class:`~ugtm.ugtm_sklearn.eGTM` with one extra parameter
``n_blocks`` (0 = auto, set to ``ceil(N / 5000)``)::

        from ugtm import eIGTM
        import numpy as np

        X_train = np.random.randn(10000, 50)
        X_test  = np.random.randn(1000, 50)

        # Fit iGTM on X_train; blocks chosen automatically
        model = eIGTM().fit(X_train)

        # 2D projection of X_test
        transformed = model.transform(X_test)

        # Or fit and transform in one call
        transformed_train = eIGTM().fit_transform(X_train)


For the low-level wrapper (mirrors :func:`~ugtm.ugtm_gtm.runGTM`)::

        from ugtm import runIGTM
        import numpy as np

        data = np.random.randn(10000, 50)
        model = runIGTM(data, n_blocks=5)

        # Access 2D coordinates
        coordinates = model.matMeans
        modes       = model.matModes


Block-wise projection for large test sets
------------------------------------------

When the test set is also large, use the
:meth:`~ugtm.ugtm_sklearn.eIGTM.transform_blocks` generator so that peak
memory stays bounded::

        from ugtm import eIGTM
        import numpy as np

        X_train = np.random.randn(10000, 50)
        X_test  = np.random.randn(10000, 50)

        model = eIGTM().fit(X_train)

        # Yields one (block_size, 2) array at a time
        for block_coords in model.transform_blocks(X_test, block_size=1000):
            # process or write block_coords here
            pass

        # To collect all at once (only if result fits in RAM):
        import numpy as np
        coords = np.vstack(list(model.transform_blocks(X_test, block_size=1000)))

For ``model='responsibilities'``, ``transform_blocks`` yields
(block_size, n_nodes) arrays, avoiding the N×K matrix entirely::

        model = eIGTM(model='responsibilities').fit(X_train)
        for resp_block in model.transform_blocks(X_test, block_size=1000):
            # resp_block.shape == (1000, n_nodes)
            pass


Choosing n_blocks
-----------------

With the default ``n_blocks=0``, blocks are sized to ~5 000 rows each.
For a dataset of N rows:

.. list-table::
   :header-rows: 1
   :widths: 20 20

   * - N
     - Auto n_blocks
   * - ≤ 5 000
     - 1  (identical to standard GTM)
   * - 10 000
     - 2
   * - 50 000
     - 10
   * - 1 000 000
     - 200

Setting ``n_blocks=1`` reproduces standard GTM behaviour (with the minor
β⁻¹ approximation noted above).


Visualize projection
--------------------

.. altair-plot::

        from ugtm import eIGTM
        import numpy as np
        import altair as alt
        import pandas as pd

        np.random.seed(0)
        X_train = np.random.randn(100, 10)
        X_test  = np.random.randn(50, 10)
        labels  = np.random.choice(['A', 'B', 'C'], size=50)

        transformed = eIGTM(n_blocks=2).fit(X_train).transform(X_test)

        df = pd.DataFrame(transformed, columns=["x1", "x2"])
        df["label"] = labels

        alt.Chart(df).mark_point(size=60).encode(
            x='x1', y='x2',
            color=alt.Color('label:N', scale=alt.Scale(scheme='set1')),
            tooltip=['x1', 'x2', 'label']
        ).properties(title="iGTM projection of X_test", width=300, height=300).interactive()


Parameter optimization
-----------------------

:class:`~ugtm.ugtm_sklearn.eIGTM` is sklearn-compatible and works with
``GridSearchCV``::

        from ugtm import eIGTM
        import numpy as np
        from sklearn.pipeline import Pipeline
        from sklearn.model_selection import GridSearchCV
        from sklearn.neighbors import KNeighborsClassifier

        X_train = np.random.randn(200, 20)
        y_train = np.random.choice([0, 1], size=200)

        pipe = Pipeline([
            ('igtm', eIGTM()),
            ('knn',  KNeighborsClassifier()),
        ])

        param_grid = {
            'igtm__k':     [4, 8],
            'igtm__s':     [0.3, 1.0],
            'igtm__regul': [0.01, 0.1],
        }

        gs = GridSearchCV(pipe, param_grid, cv=3, scoring='accuracy')
        gs.fit(X_train, y_train)
        print(gs.best_params_)