heat.cluster.batchparallelclustering
Module implementing some clustering algorithms that work in parallel on batches of data.
Module Contents
- self
Auxiliary single-process functions and base class for batch-parallel k-clustering
- _initialize_plus_plus(X, n_clusters, p, random_state=None)
Auxiliary function: single-process k-means++/k-medians++ initialization in pytorch p is the norm used for computing distances
- _kmex(X, p, n_clusters, init, max_iter, tol, random_state=None)
Auxiliary function: single-process k-means and k-medians in pytorch p is the norm used for computing distances: p=2 implies k-means, p=1 implies k-medians. p should be 1 (k-medians) or 2 (k-means). For other choice of p, we proceed as for p=2 and hope for the best. (note: kmex stands for kmeans and kmedians)
- _parallel_batched_kmex_predict(X, centers, p)
Auxiliary function: predict labels for parallel_batched_kmex
- class _BatchParallelKCluster(p: int, n_clusters: int, init: str, max_iter: int, tol: float, random_state: int | None, n_procs_to_merge: int | None)
Bases:
heat.ClusteringMixin
,heat.BaseEstimator
Base class for batch parallel k-clustering
- fit(x: heat.core.dndarray.DNDarray)
Computes the centroid of the clustering algorithm to fit the data
x
.- Parameters:
x (DNDarray) – Training instances to cluster. Shape = (n_samples, n_features). It must hold x.split=0.
- predict(x: heat.core.dndarray.DNDarray)
Predict the closest cluster each sample in
x
belongs to.In the vector quantization literature,
cluster_centers_()
is called the code book and each value returned by predict is the index of the closest code in the code book.- Parameters:
x (DNDarray) – New data to predict. Shape = (n_samples, n_features)
- class BatchParallelKMeans(n_clusters: int = 8, init: str = 'k-means++', max_iter: int = 300, tol: float = 0.0001, random_state: int = None, n_procs_to_merge: int = None)
Bases:
_BatchParallelKCluster
Batch-parallel K-Means clustering algorithm from Ref. [1]. The input must be a
DNDarray
of shape (n_samples, n_features), with split=0 (i.e. split along the sample axis). This method performs K-Means clustering on each batch (i.e. on each process-local chunk) of data individually and in parallel. After that, all centroids from the local K-Means are gathered and another instance of K-means is performed on them in order to determine the final centroids. To improve scalability of this approach also on a large number of processes, this procedure can be applied in a hierarchical manner using the parameter n_procs_to_merge.- Variables:
n_clusters (int) – The number of clusters to form as well as the number of centroids to generate.
init (str) – Method for initialization for local and global k-means: - ‘k-means++’ : selects initial cluster centers for the clustering in a smart way to speed up convergence [2]. - ‘random’: choose k observations (rows) at random from data for the initial centroids. (Not implemented yet)
max_iter (int) – Maximum number of iterations of the local/global k-means algorithms.
tol (float) – Relative tolerance with regards to inertia to declare convergence, both for local and global k-means.
random_state (int) – Determines random number generation for centroid initialization.
n_procs_to_merge (int) – Number of processes to merge after each iteration of the local k-means. If None, all processes are merged after each iteration.
References
[1] Rasim M. Alguliyev, Ramiz M. Aliguliyev, Lyudmila V. Sukhostat, Parallel batch k-means for Big data clustering, Computers & Industrial Engineering, Volume 152 (2021). https://doi.org/10.1016/j.cie.2020.107023.
- fit(x: heat.core.dndarray.DNDarray)
Computes the centroid of the clustering algorithm to fit the data
x
.- Parameters:
x (DNDarray) – Training instances to cluster. Shape = (n_samples, n_features). It must hold x.split=0.
- predict(x: heat.core.dndarray.DNDarray)
Predict the closest cluster each sample in
x
belongs to.In the vector quantization literature,
cluster_centers_()
is called the code book and each value returned by predict is the index of the closest code in the code book.- Parameters:
x (DNDarray) – New data to predict. Shape = (n_samples, n_features)
- class BatchParallelKMedians(n_clusters: int = 8, init: str = 'k-medians++', max_iter: int = 300, tol: float = 0.0001, random_state: int = None, n_procs_to_merge: int = None)
Bases:
_BatchParallelKCluster
Batch-parallel K-Medians clustering algorithm, in analogy to the K-means algorithm from Ref. [1]. This requires data to be given as DNDarray of shape (n_samples, n_features) with split=0 (i.e. split along the sample axis). The idea of the method is to perform the classical K-Medians on each batch of data (i.e. on each process-local chunk of data) individually and in parallel. After that, all centroids from the local K-Medians are gathered and another instance of K-Medians is performed on them in order to determine the final centroids. To improve scalability of this approach also on a range number of processes, this procedure can be applied in a hierarchical manor using the parameter n_procs_to_merge.
- Variables:
n_clusters (int) – The number of clusters to form as well as the number of centroids to generate.
init (str) – Method for initialization for local and global k-medians: - ‘k-medians++’ : selects initial cluster centers for the clustering in a smart way to speed up convergence [2]. - ‘random’: choose k observations (rows) at random from data for the initial centroids. (Not implemented yet)
max_iter (int) – Maximum number of iterations of the local/global k-Medians algorithms.
tol (float) – Relative tolerance with regards to inertia to declare convergence, both for local and global k-Medians.
random_state (int) – Determines random number generation for centroid initialization.
n_procs_to_merge (int) – Number of processes to merge after each iteration of the local k-Medians. If None, all processes are merged after each iteration.
References
[1] Rasim M. Alguliyev, Ramiz M. Aliguliyev, Lyudmila V. Sukhostat, Parallel batch k-means for Big data clustering, Computers & Industrial Engineering, Volume 152 (2021). https://doi.org/10.1016/j.cie.2020.107023.
- fit(x: heat.core.dndarray.DNDarray)
Computes the centroid of the clustering algorithm to fit the data
x
.- Parameters:
x (DNDarray) – Training instances to cluster. Shape = (n_samples, n_features). It must hold x.split=0.
- predict(x: heat.core.dndarray.DNDarray)
Predict the closest cluster each sample in
x
belongs to.In the vector quantization literature,
cluster_centers_()
is called the code book and each value returned by predict is the index of the closest code in the code book.- Parameters:
x (DNDarray) – New data to predict. Shape = (n_samples, n_features)