:mod:`heat.cluster.batchparallelclustering`
===========================================
.. py:module:: heat.cluster.batchparallelclustering

.. autoapi-nested-parse::

   Module implementing some clustering algorithms that work in parallel on batches of data.


Module Contents
---------------


.. data:: self
   

   Auxiliary single-process functions and base class for batch-parallel k-clustering

.. function:: _initialize_plus_plus(X, n_clusters, p, random_state=None, weights: torch.tensor = 1, max_samples=2**24 - 1)

   Auxiliary function: single-process k-means++/k-medians++ initialization in pytorch
   p is the norm used for computing distances
   weights allows to add weights to the distribution function, so that the data points with higher weights are preferred;
   note that weights must have the same dimension as X[0]
   The value max_samples=2**24 - 1 is necessary as PyTorchs multinomial currently only
   supports this number of different categories.


.. function:: _kmex(X, p, n_clusters, init, max_iter, tol, random_state=None, weights: torch.tensor = 1.0)

   Auxiliary function: single-process k-means and k-medians in pytorch
   p is the norm used for computing distances: p=2 implies k-means, p=1 implies k-medians.
   p should be 1 (k-medians) or 2 (k-means). For other choice of p, we proceed as for p=2 and hope for the best.
   (note: kmex stands for kmeans and kmedians)


.. function:: _parallel_batched_kmex_predict(X, centers, p)

   Auxiliary function: predict labels for parallel_batched_kmex


.. py:class:: _BatchParallelKCluster(p: int, n_clusters: int, init: str, max_iter: int, tol: float, random_state: Union[int, None], n_procs_to_merge: Union[int, None])

   Bases: :class:`heat.ClusteringMixin`, :class:`heat.BaseEstimator`

   Base class for batch parallel k-clustering


   .. attribute:: n_clusters
      

   .. attribute:: _init
      

   .. attribute:: max_iter
      

   .. attribute:: tol
      

   .. attribute:: random_state
      

   .. attribute:: n_procs_to_merge
      

   .. attribute:: _p
      

   .. attribute:: _cluster_centers
      :annotation: = None

      
   .. attribute:: _n_iter
      :annotation: = None

      
   .. attribute:: _functional_value
      :annotation: = None

      
   .. role:: raw-html(raw)
      :format: html
   .. method:: fit(x: heat.core.dndarray.DNDarray)

      Computes the centroid of the clustering algorithm to fit the data ``x``.

      :param x: Training instances to cluster. Shape = (n_samples, n_features). It must hold x.split=0.
      :type x: DNDarray
      :param weights: Add weights to the distribution function used in the clustering algorithm in kmex
      :type weights: torch.tensor


   .. method:: predict(x: heat.core.dndarray.DNDarray)

      Predict the closest cluster each sample in ``x`` belongs to.

      In the vector quantization literature, :func:`cluster_centers_` is called the code book and each value returned by
      predict is the index of the closest code in the code book.

      :param x: New data to predict. Shape = (n_samples, n_features)
      :type x: DNDarray


.. py:class:: BatchParallelKMeans(n_clusters: int = 8, init: str = 'k-means++', max_iter: int = 300, tol: float = 0.0001, random_state: int = None, n_procs_to_merge: int = None)

   Bases: :class:`_BatchParallelKCluster`

   Batch-parallel K-Means clustering algorithm from Ref. [1].
   The input must be a ``DNDarray`` of shape `(n_samples, n_features)`, with split=0 (i.e. split along the sample axis).
   This method performs K-Means clustering on each batch (i.e. on each process-local chunk) of data individually and in parallel.
   After that, all centroids from the local K-Means are gathered and another instance of K-means is performed on them in order to determine the final centroids.
   To improve scalability of this approach also on a large number of processes, this procedure can be applied in a hierarchical manner using the parameter `n_procs_to_merge`.

   :ivar n_clusters: The number of clusters to form as well as the number of centroids to generate.
   :vartype n_clusters: int
   :ivar init: Method for initialization for local and global k-means:
               - ‘k-means++’ : selects initial cluster centers for the clustering in a smart way to speed up convergence [2].
               - ‘random’: choose k observations (rows) at random from data for the initial centroids. (Not implemented yet)
   :vartype init: str
   :ivar max_iter: Maximum number of iterations of the local/global k-means algorithms.
   :vartype max_iter: int
   :ivar tol: Relative tolerance with regards to inertia to declare convergence, both for local and global k-means.
   :vartype tol: float
   :ivar random_state: Determines random number generation for centroid initialization.
   :vartype random_state: int
   :ivar n_procs_to_merge: Number of processes to merge after each iteration of the local k-means. If None, all processes are merged after each iteration.
   :vartype n_procs_to_merge: int

   .. rubric:: References

   [1] Rasim M. Alguliyev, Ramiz M. Aliguliyev, Lyudmila V. Sukhostat, Parallel batch k-means for Big data clustering, Computers & Industrial Engineering, Volume 152 (2021). https://doi.org/10.1016/j.cie.2020.107023.


   .. attribute:: init
      :annotation: = 'k-means++'

      
   .. role:: raw-html(raw)
      :format: html

.. py:class:: BatchParallelKMedians(n_clusters: int = 8, init: str = 'k-medians++', max_iter: int = 300, tol: float = 0.0001, random_state: int = None, n_procs_to_merge: int = None)

   Bases: :class:`_BatchParallelKCluster`

   Batch-parallel K-Medians clustering algorithm, in analogy to the K-means algorithm from Ref. [1].
   This requires data to be given as DNDarray of shape (n_samples, n_features) with split=0 (i.e. split along the sample axis).
   The idea of the method is to perform the classical K-Medians on each batch of data (i.e. on each process-local chunk of data) individually and in parallel.
   After that, all centroids from the local K-Medians are gathered and another instance of K-Medians is performed on them in order to determine the final centroids.
   To improve scalability of this approach also on a range number of processes, this procedure can be applied in a hierarchical manor using the parameter n_procs_to_merge.

   :ivar n_clusters: The number of clusters to form as well as the number of centroids to generate.
   :vartype n_clusters: int
   :ivar init: Method for initialization for local and global k-medians:
               - ‘k-medians++’ : selects initial cluster centers for the clustering in a smart way to speed up convergence [2].
               - ‘random’: choose k observations (rows) at random from data for the initial centroids. (Not implemented yet)
   :vartype init: str
   :ivar max_iter: Maximum number of iterations of the local/global k-Medians algorithms.
   :vartype max_iter: int
   :ivar tol: Relative tolerance with regards to inertia to declare convergence, both for local and global k-Medians.
   :vartype tol: float
   :ivar random_state: Determines random number generation for centroid initialization.
   :vartype random_state: int
   :ivar n_procs_to_merge: Number of processes to merge after each iteration of the local k-Medians. If None, all processes are merged after each iteration.
   :vartype n_procs_to_merge: int

   .. rubric:: References

   [1] Rasim M. Alguliyev, Ramiz M. Aliguliyev, Lyudmila V. Sukhostat, Parallel batch k-means for Big data clustering, Computers & Industrial Engineering, Volume 152 (2021). https://doi.org/10.1016/j.cie.2020.107023.


   .. attribute:: init
      :annotation: = 'k-medians++'

      
   .. role:: raw-html(raw)
      :format: html