heat.cluster._kcluster

Base-module for k-clustering algorithms

Module Contents

class _KCluster(metric: Callable, n_clusters: int, init: str | heat.core.dndarray.DNDarray, max_iter: int, tol: float, random_state: int)

Bases: heat.ClusteringMixin, heat.BaseEstimator

Base class for k-statistics clustering algorithms (kmeans, kmedians, kmedoids). The clusters are represented by centroids ci (we use the term from kmeans for simplicity)

Parameters:
  • metric (function) – One of the distance metrics in ht.spatial.distance. Needs to be passed as lambda function to take only two arrays as input

  • n_clusters (int) – The number of clusters to form as well as the number of centroids to generate.

  • init (str or DNDarray, default: ‘random’) –

    Method for initialization:

    • ‘probability_based’ : selects initial cluster centers for the clustering in a smart way to speed up convergence (k-means++)

    • ‘random’: choose k observations (rows) at random from data for the initial centroids.

    • ’batchparallel’: use the batch parallel algorithm to initialize the centroids, only available for split=0 and KMeans or KMedians

    • DNDarray: gives the initial centers, should be of Shape = (n_clusters, n_features)

  • max_iter (int) – Maximum number of iterations for a single run.

  • tol (float, default: 1e-4) – Relative tolerance with regards to inertia to declare convergence.

  • random_state (int) – Determines random number generation for centroid initialization.

_initialize_cluster_centers(x: heat.core.dndarray.DNDarray)

Initializes the K-Means centroids.

Parameters:

x (DNDarray) – The data to initialize the clusters for. Shape = (n_samples, n_features)

_assign_to_cluster(x: heat.core.dndarray.DNDarray, eval_functional_value: bool = False)

Assigns the passed data points to the centroids based on the respective metric

Parameters:
  • x (DNDarray) – Data points, Shape = (n_samples, n_features)

  • eval_functional_value (bool, default: False) – If True, the current K-Clustering functional value of the clustering algorithm is evaluated

_update_centroids(x: heat.core.dndarray.DNDarray, matching_centroids: heat.core.dndarray.DNDarray)

The Update strategy is algorithm specific (e.g. calculate mean of assigned points for kmeans, median for kmedians, etc.)

Parameters:
  • x (DNDarray) – Input Data

  • matching_centroids (DNDarray) – Index array of assigned centroids

fit(x: heat.core.dndarray.DNDarray)

Computes the centroid of the clustering algorithm to fit the data x. The full pipeline is algorithm specific.

Parameters:

x (DNDarray) – Training instances to cluster. Shape = (n_samples, n_features)

predict(x: heat.core.dndarray.DNDarray)

Predict the closest cluster each sample in x belongs to.

In the vector quantization literature, cluster_centers_() is called the code book and each value returned by predict is the index of the closest code in the code book.

Parameters:

x (DNDarray) – New data to predict. Shape = (n_samples, n_features)