:mod:`heat.cluster.kmeans` ========================== .. py:module:: heat.cluster.kmeans .. autoapi-nested-parse:: Module Implementing the Kmeans Algorithm Module Contents --------------- .. py:class:: KMeans(n_clusters: int = 8, init: Union[str, heat.core.dndarray.DNDarray] = 'random', max_iter: int = 300, tol: float = 0.0001, random_state: Optional[int] = None) Bases: :class:`heat.cluster._kcluster._KCluster` K-Means clustering algorithm. An implementation of Lloyd's algorithm [1]. :ivar n_clusters: The number of clusters to form as well as the number of centroids to generate. :vartype n_clusters: int :ivar init: Method for initialization: - ‘k-means++’ : selects initial cluster centers for the clustering in a smart way to speed up convergence [2]. - ‘random’: choose k observations (rows) at random from data for the initial centroids. - 'batchparallel': initialize by using the batch parallel algorithm (see BatchParallelKMeans for more information). - DNDarray: it should be of shape (n_clusters, n_features) and gives the initial centers. :vartype init: str or DNDarray :ivar max_iter: Maximum number of iterations of the k-means algorithm for a single run. :vartype max_iter: int :ivar tol: Relative tolerance with regards to inertia to declare convergence. :vartype tol: float :ivar random_state: Determines random number generation for centroid initialization. :vartype random_state: int .. rubric:: Notes The average complexity is given by :math:`O(k \cdot n \cdot T)`, were n is the number of samples and :math:`T` is the number of iterations. In practice, the k-means algorithm is very fast, but it may fall into local minima. That is why it can be useful to restart it several times. If the algorithm stops before fully converging (because of ``tol`` or ``max_iter``), ``labels_`` and ``cluster_centers_`` will not be consistent, i.e. the ``cluster_centers_`` will not be the means of the points in each cluster. Also, the estimator will reassign ``labels_`` after the last iteration to make ``labels_`` consistent with predict on the training set. .. rubric:: References [1] Lloyd, Stuart P., "Least squares quantization in PCM", IEEE Transactions on Information Theory, 28 (2), pp. 129–137, 1982. [2] Arthur, D., Vassilvitskii, S., "k-means++: The Advantages of Careful Seeding", Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics Philadelphia, PA, USA. pp. 1027–1035, 2007. .. attribute:: _p :annotation: = 2 .. role:: raw-html(raw) :format: html .. method:: _update_centroids(x: heat.core.dndarray.DNDarray, matching_centroids: heat.core.dndarray.DNDarray) Compute coordinates of new centroid as mean of the data points in ``x`` that are assigned to this centroid. :param x: Input data :type x: DNDarray :param matching_centroids: Array filled with indices ``i`` indicating to which cluster ``ci`` each sample point in ``x`` is assigned :type matching_centroids: DNDarray .. method:: fit(x: heat.core.dndarray.DNDarray, oversampling: float = 2, iter_multiplier: float = 1) -> KMeans.fit.self Computes the centroid of a k-means clustering. Reduce the values of the parameters 'oversampling' and 'iter_multiplier' to speed up the computation, if necessary. However, for too low values the initialization of cluster centers might fail and raise a corresponding ValueError. :param x: Training instances to cluster. Shape = (n_samples, n_features) :type x: DNDarray :param oversampling: oversampling factor used for the k-means|| initializiation of centroids :type oversampling: float :param iter_multiplier: factor that increases the number of iterations used in the initialization of centroids :type iter_multiplier: float