FirstHistogramGap

class gtda.mapper.FirstHistogramGap(linkage='single', affinity='euclidean', freq_threshold=0, max_fraction=1.0, n_bins_start=5, memory=None)[source]

Agglomerative clustering with stopping rule given by a histogram-based version of the first gap method, introduced in 1.

Given a frequency threshold f and an initial integer k: 1) create a histogram of k equally spaced bins of the number of merges in the dendrogram, as a function of the linkage parameter; 2) the value of linkage at which the tree is to be cut is the first one after which a bin of height no greater than f (i.e. a “gap”) is observed; 3) if no gap is observed, increase k and repeat 1) and 2) until termination. The algorithm can be partially overridden to ensure that the final number of clusters does not exceed a certain threshold, by passing a parameter max_fraction.

Parameters
  • linkage ('ward' | 'complete' | 'average' | 'single', optional, default: 'single') –

    Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion.

    • 'ward' minimizes the variance of the clusters being merged.

    • 'average' uses the average of the distances of each observation of the two sets.

    • 'complete' linkage uses the maximum distances between all observations of the two sets.

    • 'single' uses the minimum of the distances between all observations of the two sets.

  • affinity (str, optional, default: 'euclidean') – Metric used to compute the linkage. Can be 'euclidean', 'l1', 'l2', 'manhattan', 'cosine', or 'precomputed'. If linkage is 'ward', only 'euclidean' is accepted. If 'precomputed', a distance matrix (instead of a similarity matrix) is needed as input for fit.

  • freq_threshold (int, optional, default: 0) – The frequency threshold for declaring that a gap in the histogram of merges is present.

  • max_fraction (float, optional, default: 1.) – When not None, the algorithm is constrained to produce no more than max_fraction * n_samples clusters, even if a candidate gap is observed in the iterative process which would produce a greater number of clusters.

  • n_bins_start (int, optional, default: 5) – The initial number of bins in the iterative process for finding a gap in the histogram of merges.

  • memory (None, str or object with the joblib.Memory interface, optional, default: None) – Used to cache the output of the computation of the tree. By default, no caching is performed. If a string is given, it is the path to the caching directory.

n_clusters_

The number of clusters found by the algorithm.

Type

int

labels_

Cluster labels for each sample.

Type

ndarray of shape (n_samples,)

children_

The children of each non-leaf node. Values less than n_samples correspond to leaves of the tree which are the original samples. A node i greater than or equal to n_samples is a non-leaf node and has children children_[i - n_samples]. Alternatively at the i-th iteration, children[i][0] and children[i][1] are merged to form node n_samples + i.

Type

ndarray of shape (n_nodes - 1, 2)

n_leaves_

Number of leaves in the hierarchical tree.

Type

int

distances_

Distances between nodes in the corresponding place in children_.

Type

ndarray of shape (n_nodes - 1,)

See also

FirstSimpleGap

References

1

G. Singh, F. Mémoli, and G. Carlsson, “Topological methods for the analysis of high dimensional data sets and 3D object recognition”; in SPBG, pp. 91–100, 2007.

__init__(linkage='single', affinity='euclidean', freq_threshold=0, max_fraction=1.0, n_bins_start=5, memory=None)[source]

Initialize self. See help(type(self)) for accurate signature.

fit(X, y=None)[source]

Fit the agglomerative clustering from features or distance matrix.

The stopping rule is used to determine n_clusters_, and the full dendrogram is cut there to compute labels_.

Parameters
  • X (ndarray of shape (n_samples, n_features) or (n_samples, n_samples)) – Training instances to cluster, or distances between instances if affinity='precomputed'.

  • y (ignored) – Not used, present here for API consistency by convention.

Returns

Return type

self

fit_predict(X, y=None)

Perform clustering on X and returns cluster labels.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Input data.

  • y (Ignored) – Not used, present for API consistency by convention.

Returns

labels – Cluster labels.

Return type

ndarray of shape (n_samples,)

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

mapping of string to any

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

object