:mod:`heat.cluster.metrics`
===========================
.. py:module:: heat.cluster.metrics

.. autoapi-nested-parse::

   Cluster metrics for the Heat library.


Module Contents
---------------


.. function:: _validate_input(X, labels, metric='euclidean')

   Input validation for clustering metrics. Converts input to DNDarray if needed.

   :param X: Input data.
   :type X: {DNDarray, list}
   :param labels: Labels.
   :type labels: {DNDarray, list}
   :param metric: The metric to use for validation. Default is "euclidean".
   :type metric: str, optional

   :returns: * **X** (*DNDarray*) -- The converted and validated X.
             * **labels** (*DNDarray*) -- The converted and validated labels.

   .. rubric:: Examples

   >>> import heat as ht
   >>> X = ht.array([[1, 2], [3, 4]], dtype=ht.float)
   >>> labels = ht.array([0, 1])
   >>> _validate_input(X, labels)
   (DNDarray([[1., 2.], [3., 4.]], dtype=ht.float32, device=cpu:0, split=None),
    DNDarray([0, 1], dtype=ht.int64, device=cpu:0, split=None))


.. function:: silhouette_samples(X, labels, *, metric='euclidean')

       Compute the Silhouette Coefficient for each sample.

       The Silhouette Coefficient is a measure of how close an object is to its own cluster
       (cohesion) compared to other clusters (separation).
       The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters.

       * The score is 0 for clusters with only a single sample.
       * The calculation involves computing the mean intra-cluster distance (a) and
         the mean nearest-cluster distance (b) for each sample.

       Parameters
       ----------
       X : DNDarray
           An array of pairwise distances between samples, or a feature array.
           If `metric='precomputed'`, X is assumed to be a distance matrix and a feature array otherwise.
       labels : DNDarray
           Labels for each sample.
       metric : str, optional
           The metric to use when calculating distance between instances in a feature array.
           If metric is "precomputed", X is assumed to be a distance matrix.
           Default is "euclidean".

       Returns
       -------
       DNDarray
           Silhouette value of all individual samples in the clustering

       Notes
       -----
       The Silhouette Coefficient $s(i)$ for a single sample is defined as:
       $$s(i) =
   rac{b(i) - a(i)}{\max(a(i), b(i))}$$
       where $a(i)$ is the mean distance to other samples in the same cluster and $b(i)$
       is the mean distance to samples in the nearest neighbor cluster.

       Raises
       ------
       ValueError
           If `metric='precomputed'` and the diagonal contains non-zero elements.

       See Also
       --------
       silhouette_score : Average silhouette coefficient over all samples.

       Examples
       --------
       >>> import heat as ht
       >>> X = ht.array([[1, 2], [1, 1], [4, 4], [4, 5]], split=0)
       >>> labels = ht.array([0, 0, 1, 1], split=0)
       >>> ht.cluster.silhouette_samples(X, labels)
       DNDarray([0.7452, 0.7836, 0.7452, 0.7836], dtype=ht.float64, device=cpu:0, split=0)


.. function:: silhouette_score(X, labels, *, metric='euclidean', sample_size=None, random_state=None, **kwargs)

   Compute the mean Silhouette Coefficient of all samples.

   The Silhouette Coefficient is calculated using the mean intra-cluster distance (a)
   and the mean nearest-cluster distance (b) for each sample. The Silhouette
   Coefficient for a sample is $(b - a) / \max(a, b)$.

   * This function returns the average of `silhouette_samples`.
   * To clarify, $b$ is the distance between a sample and the nearest cluster that
     the sample is not a part of.

   :param X: An array of pairwise distances between samples, or a feature array.
   :type X: DNDarray
   :param labels: Labels for each sample.
   :type labels: DNDarray
   :param metric: The metric to use when calculating distance between instances in a feature array.
                  If metric is "precomputed", X is assumed to be a distance matrix.
                  Default is "euclidean".
   :type metric: str, optional
   :param sample_size: The size of the sample to use when computing the Silhouette Coefficient on a random subset of the data.
                       If ``sample_size is None``, no sampling is used.
   :type sample_size: int, optional
   :param random_state: Determines random number generation for selecting a subset of samples.
                        Used when `sample_size` is not `None`.
   :type random_state: int, optional
   :param \*\*kwargs: Additional keyword arguments passed to `silhouette_samples`.
   :type \*\*kwargs: optional

   :returns: Silhouette score of the clustering
   :rtype: float

   .. rubric:: Notes

   The best value is 1 and the worst value is -1. Values near 0 indicate
   overlapping clusters.

   .. seealso::

      :py:obj:`silhouette_samples`
          Silhouette Coefficient for each individual sample.

   .. rubric:: Examples

   >>> import heat as ht
   >>> X = ht.array([[1, 2], [1, 1], [4, 4], [4, 5]], split=0)
   >>> labels = ht.array([0, 0, 1, 1], split=0)
   >>> ht.cluster.silhouette_score(X, labels)
   0.76439