heat.cluster.spectral

Module for Spectral Clustering, a graph-based machine learning algorithm

Module Contents

class Spectral(n_clusters: int = None, gamma: float = 1.0, metric: str = 'rbf', laplacian: str = 'fully_connected', threshold: float = 1.0, boundary: str = 'upper', eigen_solver: str = 'randomized', reigh_rank: int = 100, reigh_n_oversamples: int = 10, reigh_power_iter: int = 0, lanczos_n_iter: int = 300, assign_labels: str = 'kmeans', **params)[source]

Bases: heat.ClusteringMixin, heat.BaseEstimator

Spectral clustering

Variables:
  • n_clusters (int) – Number of clusters to fit

  • gamma (float) – Kernel coefficient sigma for ‘rbf’, ignored for metric=’euclidean’

  • metric (string) –

    How to construct the similarity matrix.

    • ’rbf’ : construct the similarity matrix using a radial basis function (RBF) kernel.

    • ’euclidean’ : construct the similarity matrix as only euclidean distance.

  • laplacian (str) – How to calculate the graph laplacian (affinity) Currently supported : ‘fully_connected’, ‘eNeighbour’

  • threshold (float) – Threshold for affinity matrix if laplacian=’eNeighbour’ Ignorded for laplacian=’fully_connected’

  • boundary (str) – How to interpret threshold: ‘upper’, ‘lower’ Ignorded for laplacian=’fully_connected’

  • eigen_solver (str) – The eigenvalue decomposition strategy to use. - ‘lanczos’ : Use Lanczos iterations to reduce the Laplacian matrix size before applying the torch eigenvalue solver. - ‘randomized’ : Use a randomized algorithm to compute the approximate eigenvalues and eigenvectors.

  • reigh_rank (int) – number of samples for randomized eigenvalue decomposition. Only used if eigen_solver=’randomized’. It must hold reigh_rank >= n_clusters. If n_clusters is None (automatic selection of number of clusters), reigh_rank gives an upper bound on the number of clusters that can be found. Therefore, reigh_rank should be set high enough to capture the expected number of clusters in that case.

  • reigh_n_oversamples (int) – number of oversamples for randomized eigenvalue decomposition. Only used if eigen_solver=’randomized’. Default is 10.

  • reigh_power_iter (int) – number of power iterations for randomized eigenvalue decomposition. Only used if eigen_solver=’randomized’. Default is 0. Consider increasing this value if the eigen-spectrum of the Laplacian decays slowly.

  • lanczos_n_iter (int) – number of Lanczos iterations for Eigenvalue decomposition. Only used if eigen_solver=’lanczos’. Default is 300.

  • assign_labels (str) – The strategy to use to assign labels in the embedding space.

  • **params (dict) – Parameter dictionary for the assign_labels estimator

n_clusters = None
gamma = 1.0
metric = 'rbf'
laplacian = 'fully_connected'
threshold = 1.0
boundary = 'upper'
lanczos_n_iter = 300
assign_labels = 'kmeans'
eigen_solver = 'randomized'
reigh_n_oversamples = 10
reigh_power_iter = 0
reigh_rank = 100
_labels = None
_spectral_embedding(x: heat.core.dndarray.DNDarray) Tuple[heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray][source]

Helper function for dataset x embedding. Returns Tupel(Eigenvalues, Eigenvectors) of the graph’s Laplacian matrix.

Parameters:

x (DNDarray) – Sample Matrix for which the embedding should be calculated

Notes

This will throw out the complex side of the eigenvalues found during this.

fit(x: heat.core.dndarray.DNDarray)[source]

Clusters dataset X via spectral embedding. Computes the low-dim representation by calculation of eigenspectrum (eigenvalues and eigenvectors) of the graph laplacian from the similarity matrix and fits the eigenvectors that correspond to the k lowest eigenvalues with a seperate clustering algorithm (currently only kmeans is supported). Similarity metrics for adjacency calculations are supported via spatial.distance. The eigenvalues and eigenvectors are computed by reducing the Laplacian via lanczos iterations and using the torch eigenvalue solver on this smaller matrix. If other eigenvalue decompostion methods are supported, this will be expanded.

Parameters:

x (DNDarray) – Training instances to cluster. Shape = (n_samples, n_features)

predict(x: heat.core.dndarray.DNDarray) heat.core.dndarray.DNDarray[source]

Return the label each sample in X belongs to. X is transformed to the low-dim representation by calculation of eigenspectrum (eigenvalues and eigenvectors) of the graph laplacian from the similarity matrix. Inference of lables is done by extraction of the closest centroid of the n_clusters eigenvectors from the previously fitted clustering algorithm (kmeans).

Parameters:

x (DNDarray) – New data to predict. Shape = (n_samples, n_features)

Warning

Caution: Calculation of the low-dim representation requires some time!