Overview

A high-performance topological machine learning toolbox in Python

giotto-tda is a high performance topological machine learning toolbox in Python built on top of scikit-learn and is distributed under the GNU AGPLv3 license. It is part of the Giotto family of open-source projects.

Guiding principles

  • Seamless integration with scikit-learn
    Strictly adhere to the scikit-learn API and development guidelines, inherit the strengths of that framework.
  • Code modularity
    Topological feature creation steps as transformers. Allow for the creation of a large number of topologically-powered machine learning pipelines.
  • Standardisation
    Implement the most successful techniques from the literature into a generic framework with a consistent API.
  • Innovation
    Improve on existing algorithms, and make new ones available in open source.
  • Performance
    For the most demanding computations, fall back to state-of-the-art C++ implementations, bound efficiently to Python. Vectorized code and implements multi-core parallelism (with joblib).
  • Data structures
    Support for tabular data, time series, graphs, and images.

30s guide to giotto-tda

_images/giotto-tda_workflow.png

For installation instructions, see the installation instructions.

The functionalities of giotto-tda are provided in scikit-learn–style transformers. This allows you to generate topological features from your data in a familiar way. Here is an example with the VietorisRipsPersistence transformer:

from gtda.homology import VietorisRipsPersistence
VR = VietorisRipsPersistence()

which computes topological summaries, called persistence diagrams, from collections of point clouds or weighted graphs, as follows:

diagrams = VR.fit_transform(point_clouds)

A plotting API allows for quick visual inspection of the outputs of many of giotto-tda’s transformers. To visualize the i-th output sample, run

diagrams = VR.plot(diagrams, sample=i)

You can create scalar or vector features from persistence diagrams using giotto-tda’s dedicated transformers. Here is an example with the PersistenceEntropy transformer:

from gtda.diagrams import PersistenceEntropy
PE = PersistenceEntropy()
features = PE.fit_transform(diagrams)

features is a two-dimensional numpy array. This is important to making this type of topological feature generation fit into a typical machine learning workflow from scikit-learn. In particular, topological feature creation steps can be fed to or used alongside models from scikit-learn, creating end-to-end pipelines which can be evaluated in cross-validation, optimised via grid-searches, etc.:

from sklearn.ensemble import RandomForestClassifier
from gtda.pipeline import make_pipeline
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(point_clouds, labels)
RFC = RandomForestClassifier()
model = make_pipeline(VR, PE, RFC)
model.fit(X_train, y_train)
model.score(X_valid, y_valid)

giotto-tda also implements the Mapper algorithm as a highly customisable scikit-learn Pipeline, and provides simple plotting functions for visualizing output Mapper graphs and have real-time interaction with the pipeline parameters:

from gtda.mapper import make_mapper_pipeline
from sklearn.decomposition import PCA
from sklearn.cluster import DBSCAN

pipe = make_mapper_pipeline(filter_func=PCA(), clusterer=DBSCAN())
plot_interactive_mapper_graph(pipe, data)

Resources

Tutorials and examples

We provide a number of tutorials and examples, which offer:

  • quick start guides to the API;

  • in-depth examples showcasing more of the library’s features;

  • intuitive explanations of topological techniques.

Use cases

A selection of use cases for giotto-tda is collected at this page. The related GitHub repositories can be found at github.

What’s new


Major Features and Improvements

This is a major release which adds substantial new functionality and introduces several improvements.

Persistent homology of directed flag complexes via pyflagser

  • The pyflagser package (source, docs) is now an official dependency of giotto-tda.

  • The FlagserPersistence transformer has been added to gtda.homology (#339). It wraps pyflagser.flagser_weighted to allow for computations of persistence diagrams from directed or undirected weighted graphs. A new notebook demonstrates its use.

Edge collapsing and performance improvements for persistent homology

  • GUDHI C++ components have been updated to the state of GUDHI v3.3.0, yielding performance improvements in SparseRipsPersistence, EuclideanCechPersistence and CubicalPersistence (#468).

  • Bindings for GUDHI’s edge collapser have been created and can now be used as an optional preprocessing step via the optional keyword argument collapse_edges in VietorisRipsPersistence and in gtda.externals.ripser (#469 and #483). When collapse_edges=True, and the input data and/or number of required homology dimensions is sufficiently large, the resulting runtimes for Vietoris–Rips persistent homology are state of the art.

  • The performance of the Ripser bindings has otherwise been improved by avoiding unnecessary data copies, better managing the memory, and using more efficient matrix routines (#501 and #507).

New transformers and functionality in gtda.homology

  • The WeakAlphaPersistence transformer has been added to gtda.homology (#464). Like VietorisRipsPersistence, SparseRipsPersistence and EuclideanCechPersistence, it computes persistent homology from point clouds, but its runtime can scale much better with size in low dimensions.

  • VietorisRipsPersistence now accepts sparse input when metric="precomputed" (#424).

  • CubicalPersistence now accepts lists of 2D arrays (#503).

  • A reduced_homology parameter has been added to all persistent homology transformers. When True, one infinite bar in the H0 barcode is removed for the user automatically. Previously, it was not possible to keep these bars in the simplicial homology transformers. The default is always True, which implies a breaking change in the case of CubicalPersistence (#467).

Persistence diagrams

  • A ComplexPolynomial feature extraction transformer has been added (#479).

  • A NumberOfPoints feature extraction transformer has been added (#496).

  • An option to normalize the entropy in PersistenceEntropy according to a heuristic has been added, and a nan_fill_value parameter allows to replace any NaN produced by the entropy calculation with a fixed constant (#450).

  • The computations in HeatKernel, PersistenceImage and in the pairwise distances and amplitudes related to them has been changed to yield the continuum limit when n_bins tends to infinity; sigma is now measured in the same units as the filtration parameter and defaults to 0.1 (#454).

New curves subpackage

A new curves subpackage has been added to preprocess, and extract features from, collections of multi-channel curves such as returned by BettiCurve, PersistenceLandscape and Silhouette (#480). It contains:

  • A StandardFeatures transformer that can extract features channel-wise in a generic way.

  • A Derivative transformer that computes channel-wise derivatives of any order by discrete differences (#492).

New metaestimators subpackage

A new metaestimator subpackage has been added with a CollectionTransformer meta-estimator which converts any transformer instance into a fit-transformer acting on collections (#495).

Images

  • A DensityFiltration for collections of binary images has been added (#473).

  • Padder and Inverter have been extended to greyscale images (#489).

Time series

  • TakensEmbedding is now a new transformer acting on collections of time series (#460).

  • The former TakensEmbedding acting on a single time series has been renamed to SingleTakensEmbedding transformer, and the internal logic employed in its fit for computing optimal hyperparameters is now available via a takens_embedding_optimal_parameters convenience function (#460).

  • The _slice_windows method of SlidingWindow has been made public and renamed into slice_windows (#460).

Graphs

  • GraphGeodesicDistance has been improved as follows (#422):

    • The new parameters directed, unweighted and method have been added.

    • The rules on the role of zero entries, infinity entries, and non-stored values have been made clearer.

    • Masked arrays are now supported.

  • A mode parameter has been added to KNeighborsGraph; as in scikit-learn, it can be set to either "distance" or "connectivity" (#478).

  • List input is now accepted by all transformers in gtda.graphs, and outputs are consistently either lists or 3D arrays (#478).

  • Sparse matrices returned by KNeighborsGraph and TransitionGraph now have int dtype (0-1 adjacency matrices), and are not necessarily symmetric (#478).

Mapper

  • Pullback cover set labels and partial cluster labels have been added to Mapper node hovertexts (#445).

  • The functionality of Nerve and make_mapper_pipeline has been greatly extended (#447 and #456):

    • Node and edge metadata are now accessible in output igraph.Graph objects by means of the VertexSeq and EdgeSeq attributes vs and es (respectively). Graph-level dictionaries are no longer used.

    • Available node metadata can be accessed by graph.vs[attr_name] where for attr_name is one of "pullback_set_label", "partial_cluster_label", or "node_elements".

    • Sizes of intersections are automatically stored as edge weights, accessible by graph.es["weight"].

    • A "store_intersections" keyword argument has been added to Nerve and make_mapper_pipeline to allow to store the indices defining node intersections as edge attributes, accessible via graph.es["edge_elements"].

    • A contract_nodes optional parameter has been added to both Nerve and make_mapper_pipeline; nodes which are subsets of other nodes are thrown away from the graph when this parameter is set to True.

    • A graph_ attribute is stored during Nerve.fit.

  • Two of the Nerve parameters (min_intersection and the new contract_nodes) are now available in the widgets generated by plot_interactive_mapper_graph, and the layout of these widgets has been improved (#456).

  • ParallelClustering and Nerve have been exposed in the documentation and in gtda.mapper’s __init__ (#447).

Plotting

  • A plot_params kwarg is available in plotting functions and methods throughout to allow user customisability of output figures. The user must pass a dictionary with keys "layout" and/or "trace" (or "traces" in some cases) (#441).

  • Several plots produced by plot class methods now have default titles (#453).

  • Infinite deaths are now plotted by plot_diagrams (#461).

  • Possible multiplicities of persistence pairs in persistence diagram plots are now indicated in the hovertext (#454).

  • plot_heatmap now accepts boolean array input (#444).

New tutorials and examples

The following new tutorials have been added:

  • Topology of time series, which explains the theory of the Takens time-delay embedding and its use with persistent homology, demonstrates the new API of several components in gtda.time_series, and shows how to construct time series classification pipelines in giotto-tda by partially reproducing arXiv:1910:08245.

  • Topology in time series forecasting, which explains how to set up time series forecasting pipelines in giotto-tda via TransformerResamplerMixin``s and the ``giotto-tda Pipeline class.

  • Topological feature extraction from graphs, which explains what the features extracted from directed or undirected graphs by VietorisRipsPersistence, SparseRipsPersistence and FlagserPersistence are.

  • Classifying handwritten digits, which presents a fully-fledged machine learning pipeline in which cubical persistent homology is applied to the classification of handwritten images from he MNIST dataset, partially reproducing arXiv:1910.08345.

Utils

  • A check_collection input validation function has been added (#491).

  • validate_params now accepts "in" and "of" keys simultaneously in the references dictionaries, with "in" used for non-list-like types and "of" otherwise (#502).

Installation improvements

  • pybind11 is now treated as a standard git submodule in the developer installation (#459).

  • pandas is now part of the testing requirements when intalling from source (#508).

Bug Fixes

  • A bug has been fixed which could lead to features with negative lifetime in persistent homology transformers when infinity_values was set too low (#339).

  • By relying on scipy’s shortest_path instead of scikit-learn’s graph_shortest_path, some errors in computing GraphGeodesicDistance (e.g. when som edges are zero) have been fixed (#422).

  • A bug in the handling of COO matrices by the ripser interface has been fixed (#465).

  • A bug which led to the incorrect handling of the homology_dimensions parameter in Filtering has been fixed (#439).

  • An issue with the use of joblib.Parallel, which led to errors when attempting to run HeatKernel, PersistenceImage, and the corresponding amplitudes and distances on large datasets, has been fixed (#428 and #481).

  • A bug leading to plots of persistence diagrams not showing points with negative births or deaths has been fixed, as has a bug with the computation of the range to be shown in the plot (#437).

  • A bug in the handling of persistence pairs with negative death values by Filtering has been fixed (#436).

  • A bug in the handling of homology_dimension_ix (now renamed to homology_dimension_idx) in the plot methods of HeatKernel and PersistenceImage has been fixed (#452).

  • A bug in the labelling of axes in HeatKernel and PersistenceImage plots has ben fixed (#453 and #454).

  • PersistenceLandscape plots now show all homology dimensions, instead of just the first (#454).

  • A bug in the computation of amplitudes and pairwise distances based on persistence images has been fixed (#454).

  • Silhouette now does not create NaNs when a subdiagram is trivial (#454).

  • CubicalPersistence now does not create pairs with negative persistence when infinity_values is set too low (#467).

  • Warnings are no longer thrown by KNeighborsGraph when metric="precomputed" (#506).

  • A bug in Labeller.resample affecting cases in which n_steps_future >= size - 1, has been fixed (#460).

  • A bug in validate_params, affecting the case of tuples of allowed types, has been fixed (#502).

Backwards-Incompatible Changes

  • The minimum required versions from most of the dependencies have been bumped. The updated dependencies are numpy >= 1.19.1, scipy >= 1.5.0, joblib >= 0.16.0, scikit-learn >= 0.23.1, python-igraph >= 0.8.2, plotly >= 4.8.2, and pyflagser >= 0.4.1 (#457).

  • GraphGeodesicDistance now returns either lists or 3D dense ndarrays for compatibility with the homology transformers - By relying on scipy’s shortest_path instead of scikit-learn’s graph_shortest_path, some errors in computing GraphGeodesicDistance (e.g. when som edges are zero) have been fixed (#422).

  • The output of PairwiseDistance has been transposed to match scikit-learn convention (n_samples_transform, n_samples_fit) (#420).

  • plot class methods now return figures instead of showing them (#441).

  • Mapper node and edge attributes are no longer stored as graph-level dictionaries, "node_id" is no longer an available node attribute, and the attributes nodes_ and edges_ previously stored by Nerve.fit have been removed in favour of a graph_ attribute (#447).

  • The homology_dimension_ix parameter available in some transformers in gtda.diagrams has been renamed to homology_dimensions_idx (#452).

  • The base of the logarithm used by PersistenceEntropy is now 2 instead of e, and NaN values are replaced with -1 instead of 0 by default (#450 and #474).

  • The outputs of PersistenceImage, HeatKernel and of the pairwise distances and amplitudes based on them is now different due to the improvements described above.

  • Weights are no longer stored in the effective_metric_params_ attribute of PairwiseDistance, Amplitude and Scaler objects when the metric is persistence-image–based; only the weight function is (#454).

  • The homology_dimensions_ attributes of several transformers have been converted from lists to tuples. When possible, homology dimensions stored as parts of attributes are now presented as ints (#454).

  • gaussian_filter (used to make heat– and persistence-image–based representations/pairwise distances/amplitudes) is now called with mode="constant" instead of "reflect" (#454).

  • The default value of order in Amplitude has been changed from 2. to None, giving vector instead of scalar features (#454).

  • The meaning of the default None for weight_function in PersistenceImage (and in Amplitude and PairwiseDistance when metric="persistence_image") has been changed from the identity function to the function returning a vector of ones (#454).

  • Due to the updates in the GUDHI components, some of the bindings and Python interfaces to the GUDHI C++ components in gtda.externals have changed (#468).

  • Labeller.transform now returns a 1D array instead of a column array (#475).

  • PersistenceLandscape now returns 3D arrays instead of 4D ones, for compatibility with the new curves subpackage (#480).

  • By default, CubicalPersistence now removes one infinite bar in H0 (#467, and see above).

  • The former width parameter in SlidingWindow and Labeller has been replaced with a more intuitive size parameter. The relation between the two is: size = width + 1 (#460).

  • clusterer is now a required parameter in ParallelClustering (#508).

  • The max_fraction parameter in FirstSimpleGap and FirstHistogramGap now indicates the floor of max_fraction * n_samples; its default value has been changed from None to 1 (#412).

Thanks to our Contributors

This release contains contributions from many people:

Umberto Lupo, Guillaume Tauzin, Julian Burella Pérez, Wojciech Reise, Lewis Tunstall, Nick Sale, and Anibal Medina-Mardones.

We are also grateful to all who filed issues or helped resolve them, asked and answered questions, and were part of inspiring discussions.