heat.cluster._kcluster
Base-module for k-clustering algorithms
Module Contents
- class _KCluster(metric: Callable, n_clusters: int, init: str | heat.core.dndarray.DNDarray, max_iter: int, tol: float, random_state: int)[source]
Bases:
heat.ClusteringMixin,heat.BaseEstimatorBase class for k-statistics clustering algorithms (kmeans, kmedians, kmedoids). The clusters are represented by centroids ci (we use the term from kmeans for simplicity)
- Parameters:
metric (function) – One of the distance metrics in ht.spatial.distance. Needs to be passed as lambda function to take only two arrays as input
n_clusters (int) – The number of clusters to form as well as the number of centroids to generate.
init (str or DNDarray, default: ‘random’) –
Method for initialization:
‘probability_based’ : selects initial cluster centers for the clustering in a smart way to speed up convergence (k-means++)
‘random’: choose k observations (rows) at random from data for the initial centroids.
’batchparallel’: use the batch parallel algorithm to initialize the centroids, only available for split=0 and KMeans or KMedians
DNDarray: gives the initial centers, should be of Shape = (n_clusters, n_features)
max_iter (int) – Maximum number of iterations for a single run.
tol (float, default: 1e-4) – Relative tolerance with regards to inertia to declare convergence.
random_state (int) – Determines random number generation for centroid initialization.
- n_clusters
- init
- max_iter
- tol
- random_state
- _metric
- _cluster_centers = None
- _functional_value = None
- _labels = None
- _inertia = None
- _n_iter = None
- _p = None
- _initialize_cluster_centers(x: heat.core.dndarray.DNDarray, oversampling: float, iter_multiplier: float)[source]
Initializes the K-Means centroids.
- Parameters:
x (DNDarray) – The data to initialize the clusters for. Shape = (n_samples, n_features)
oversampling (float) – oversampling factor used in the k-means|| initializiation of centroids
iter_multiplier (float) – factor that increases the number of iterations used in the initialization of centroids
- _centroid_sampling_helper(x: heat.core.dndarray.DNDarray, centroids: heat.core.dndarray.DNDarray, oversampling: float, num_iters: int)[source]
Helper function for the k-means|| initialization of centroids. Samples new centroids based on a probability distribution derived from the distance of data points to the current set of centroids.
- Parameters:
x (DNDarray) – The data to initialize the clusters for. Shape = (n_samples, n_features)
centroids (DNDarray) – The initial set of centroids
oversampling (float) – oversampling factor used in the k-means|| initializiation of centroids
num_iters (float) – number of iterations used in the initialization of centroids
- _assign_to_cluster(x: heat.core.dndarray.DNDarray, eval_functional_value: bool = False)[source]
Assigns the passed data points to the centroids based on the respective metric
- _update_centroids(x: heat.core.dndarray.DNDarray, matching_centroids: heat.core.dndarray.DNDarray)[source]
The Update strategy is algorithm specific (e.g. calculate mean of assigned points for kmeans, median for kmedians, etc.)
- fit(x: heat.core.dndarray.DNDarray)[source]
Computes the centroid of the clustering algorithm to fit the data
x. The full pipeline is algorithm specific.- Parameters:
x (DNDarray) – Training instances to cluster. Shape = (n_samples, n_features)
- predict(x: heat.core.dndarray.DNDarray)[source]
Predict the closest cluster each sample in
xbelongs to.In the vector quantization literature,
cluster_centers_()is called the code book and each value returned by predict is the index of the closest code in the code book.- Parameters:
x (DNDarray) – New data to predict. Shape = (n_samples, n_features)