heat.cluster._kcluster

Base-module for k-clustering algorithms

Module Contents

class _KCluster(metric: Callable, n_clusters: int, init: str | heat.core.dndarray.DNDarray, max_iter: int, tol: float, random_state: int)[source]

Bases: heat.ClusteringMixin, heat.BaseEstimator

Base class for k-statistics clustering algorithms (kmeans, kmedians, kmedoids). The clusters are represented by centroids ci (we use the term from kmeans for simplicity)

Parameters:
  • metric (function) – One of the distance metrics in ht.spatial.distance. Needs to be passed as lambda function to take only two arrays as input

  • n_clusters (int) – The number of clusters to form as well as the number of centroids to generate.

  • init (str or DNDarray, default: ‘random’) –

    Method for initialization:

    • ‘probability_based’ : selects initial cluster centers for the clustering in a smart way to speed up convergence (k-means++)

    • ‘random’: choose k observations (rows) at random from data for the initial centroids.

    • ’batchparallel’: use the batch parallel algorithm to initialize the centroids, only available for split=0 and KMeans or KMedians

    • DNDarray: gives the initial centers, should be of Shape = (n_clusters, n_features)

  • max_iter (int) – Maximum number of iterations for a single run.

  • tol (float, default: 1e-4) – Relative tolerance with regards to inertia to declare convergence.

  • random_state (int) – Determines random number generation for centroid initialization.

n_clusters
init
max_iter
tol
random_state
_metric
_cluster_centers = None
_functional_value = None
_labels = None
_inertia = None
_n_iter = None
_p = None
_initialize_cluster_centers(x: heat.core.dndarray.DNDarray, oversampling: float, iter_multiplier: float)[source]

Initializes the K-Means centroids.

Parameters:
  • x (DNDarray) – The data to initialize the clusters for. Shape = (n_samples, n_features)

  • oversampling (float) – oversampling factor used in the k-means|| initializiation of centroids

  • iter_multiplier (float) – factor that increases the number of iterations used in the initialization of centroids

_centroid_sampling_helper(x: heat.core.dndarray.DNDarray, centroids: heat.core.dndarray.DNDarray, oversampling: float, num_iters: int)[source]

Helper function for the k-means|| initialization of centroids. Samples new centroids based on a probability distribution derived from the distance of data points to the current set of centroids.

Parameters:
  • x (DNDarray) – The data to initialize the clusters for. Shape = (n_samples, n_features)

  • centroids (DNDarray) – The initial set of centroids

  • oversampling (float) – oversampling factor used in the k-means|| initializiation of centroids

  • num_iters (float) – number of iterations used in the initialization of centroids

_assign_to_cluster(x: heat.core.dndarray.DNDarray, eval_functional_value: bool = False)[source]

Assigns the passed data points to the centroids based on the respective metric

Parameters:
  • x (DNDarray) – Data points, Shape = (n_samples, n_features)

  • eval_functional_value (bool, default: False) – If True, the current K-Clustering functional value of the clustering algorithm is evaluated

_update_centroids(x: heat.core.dndarray.DNDarray, matching_centroids: heat.core.dndarray.DNDarray)[source]

The Update strategy is algorithm specific (e.g. calculate mean of assigned points for kmeans, median for kmedians, etc.)

Parameters:
  • x (DNDarray) – Input Data

  • matching_centroids (DNDarray) – Index array of assigned centroids

fit(x: heat.core.dndarray.DNDarray)[source]

Computes the centroid of the clustering algorithm to fit the data x. The full pipeline is algorithm specific.

Parameters:

x (DNDarray) – Training instances to cluster. Shape = (n_samples, n_features)

predict(x: heat.core.dndarray.DNDarray)[source]

Predict the closest cluster each sample in x belongs to.

In the vector quantization literature, cluster_centers_() is called the code book and each value returned by predict is the index of the closest code in the code book.

Parameters:

x (DNDarray) – New data to predict. Shape = (n_samples, n_features)