heat.cluster.kmeans

Module Implementing the Kmeans Algorithm

Module Contents

class KMeans(n_clusters: int = 8, init: str | heat.core.dndarray.DNDarray = 'random', max_iter: int = 300, tol: float = 0.0001, random_state: int | None = None)

Bases: heat.cluster._kcluster._KCluster

K-Means clustering algorithm. An implementation of Lloyd’s algorithm [1].

Variables:
  • n_clusters (int) – The number of clusters to form as well as the number of centroids to generate.

  • init (str or DNDarray) –

    Method for initialization:

    • ‘k-means++’ : selects initial cluster centers for the clustering in a smart way to speed up convergence [2].

    • ‘random’: choose k observations (rows) at random from data for the initial centroids.

    • ’batchparallel’: initialize by using the batch parallel algorithm (see BatchParallelKMeans for more information).

    • DNDarray: it should be of shape (n_clusters, n_features) and gives the initial centers.

  • max_iter (int) – Maximum number of iterations of the k-means algorithm for a single run.

  • tol (float) – Relative tolerance with regards to inertia to declare convergence.

  • random_state (int) – Determines random number generation for centroid initialization.

Notes

The average complexity is given by \(O(k \cdot n \cdot T)\), were n is the number of samples and \(T\) is the number of iterations. In practice, the k-means algorithm is very fast, but it may fall into local minima. That is why it can be useful to restart it several times. If the algorithm stops before fully converging (because of tol or max_iter), labels_ and cluster_centers_ will not be consistent, i.e. the cluster_centers_ will not be the means of the points in each cluster. Also, the estimator will reassign labels_ after the last iteration to make labels_ consistent with predict on the training set.

References

[1] Lloyd, Stuart P., “Least squares quantization in PCM”, IEEE Transactions on Information Theory, 28 (2), pp. 129–137, 1982.

[2] Arthur, D., Vassilvitskii, S., “k-means++: The Advantages of Careful Seeding”, Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics Philadelphia, PA, USA. pp. 1027–1035, 2007.

_update_centroids(x: heat.core.dndarray.DNDarray, matching_centroids: heat.core.dndarray.DNDarray)

Compute coordinates of new centroid as mean of the data points in x that are assigned to this centroid.

Parameters:
  • x (DNDarray) – Input data

  • matching_centroids (DNDarray) – Array filled with indices i indicating to which cluster ci each sample point in x is assigned

fit(x: heat.core.dndarray.DNDarray) KMeans.fit.self

Computes the centroid of a k-means clustering.

Parameters:

x (DNDarray) – Training instances to cluster. Shape = (n_samples, n_features)

_initialize_cluster_centers(x: heat.core.dndarray.DNDarray)

Initializes the K-Means centroids.

Parameters:

x (DNDarray) – The data to initialize the clusters for. Shape = (n_samples, n_features)

_assign_to_cluster(x: heat.core.dndarray.DNDarray, eval_functional_value: bool = False)

Assigns the passed data points to the centroids based on the respective metric

Parameters:
  • x (DNDarray) – Data points, Shape = (n_samples, n_features)

  • eval_functional_value (bool, default: False) – If True, the current K-Clustering functional value of the clustering algorithm is evaluated

predict(x: heat.core.dndarray.DNDarray)

Predict the closest cluster each sample in x belongs to.

In the vector quantization literature, cluster_centers_() is called the code book and each value returned by predict is the index of the closest code in the code book.

Parameters:

x (DNDarray) – New data to predict. Shape = (n_samples, n_features)