class gtda.mapper.FirstHistogramGap(linkage='single', affinity='euclidean', freq_threshold=0, max_fraction=1.0, n_bins_start=5, memory=None)[source]

Agglomerative clustering with stopping rule given by a histogram-based version of the first gap method, introduced in 1.

Given a frequency threshold f and an initial integer k: 1) create a histogram of k equally spaced bins of the number of merges in the dendrogram, as a function of the linkage parameter; 2) the value of linkage at which the tree is to be cut is the first one after which a bin of height no greater than f (i.e. a “gap”) is observed; 3) if no gap is observed, increase k and repeat 1) and 2) until termination. The algorithm can be partially overridden to ensure that the final number of clusters does not exceed a certain threshold, by passing a parameter max_fraction.

  • linkage ('ward' | 'complete' | 'average' | 'single', optional, default: 'single') –

    Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion.

    • 'ward' minimizes the variance of the clusters being merged.

    • 'average' uses the average of the distances of each observation of the two sets.

    • 'complete' linkage uses the maximum distances between all observations of the two sets.

    • 'single' uses the minimum of the distances between all observations of the two sets.

  • affinity (str, optional, default: 'euclidean') – Metric used to compute the linkage. Can be 'euclidean', 'l1', 'l2', 'manhattan', 'cosine', or 'precomputed'. If linkage is 'ward', only 'euclidean' is accepted. If 'precomputed', a distance matrix (instead of a similarity matrix) is needed as input for fit.

  • freq_threshold (int, optional, default: 0) – The frequency threshold for declaring that a gap in the histogram of merges is present.

  • max_fraction (float, optional, default: 1.) – When not None, the algorithm is constrained to produce no more than max_fraction * n_samples clusters, even if a candidate gap is observed in the iterative process which would produce a greater number of clusters.

  • n_bins_start (int, optional, default: 5) – The initial number of bins in the iterative process for finding a gap in the histogram of merges.

  • memory (None, str or object with the joblib.Memory interface, optional, default: None) – Used to cache the output of the computation of the tree. By default, no caching is performed. If a string is given, it is the path to the caching directory.


The number of clusters found by the algorithm.




Cluster labels for each sample.


ndarray of shape (n_samples,)


The children of each non-leaf node. Values less than n_samples correspond to leaves of the tree which are the original samples. A node i greater than or equal to n_samples is a non-leaf node and has children children_[i - n_samples]. Alternatively at the i-th iteration, children[i][0] and children[i][1] are merged to form node n_samples + i.


ndarray of shape (n_nodes - 1, 2)


Number of leaves in the hierarchical tree.




Distances between nodes in the corresponding place in children_.


ndarray of shape (n_nodes - 1,)

See also




G. Singh, F. Mémoli, and G. Carlsson, “Topological methods for the analysis of high dimensional data sets and 3D object recognition”; in SPBG, pp. 91–100, 2007.

__init__(linkage='single', affinity='euclidean', freq_threshold=0, max_fraction=1.0, n_bins_start=5, memory=None)[source]

Initialize self. See help(type(self)) for accurate signature.

fit(X, y=None)[source]

Fit the agglomerative clustering from features or distance matrix.

The stopping rule is used to determine n_clusters_, and the full dendrogram is cut there to compute labels_.

  • X (ndarray of shape (n_samples, n_features) or (n_samples, n_samples)) – Training instances to cluster, or distances between instances if affinity='precomputed'.

  • y (ignored) – Not used, present here for API consistency by convention.


Return type


fit_predict(X, y=None)

Perform clustering on X and returns cluster labels.

  • X (array-like of shape (n_samples, n_features)) – Input data.

  • y (Ignored) – Not used, present for API consistency by convention.


labels – Cluster labels.

Return type

ndarray of shape (n_samples,)


Get parameters for this estimator.


deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.


params – Parameter names mapped to their values.

Return type

mapping of string to any


Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.


**params (dict) – Estimator parameters.


self – Estimator instance.

Return type