:mod:`heat.cluster.metrics` =========================== .. py:module:: heat.cluster.metrics .. autoapi-nested-parse:: Cluster metrics for the Heat library. Module Contents --------------- .. function:: _validate_input(X, labels, metric='euclidean') Input validation for clustering metrics. Converts input to DNDarray if needed. :param X: Input data. :type X: {DNDarray, list} :param labels: Labels. :type labels: {DNDarray, list} :param metric: The metric to use for validation. Default is "euclidean". :type metric: str, optional :returns: * **X** (*DNDarray*) -- The converted and validated X. * **labels** (*DNDarray*) -- The converted and validated labels. .. rubric:: Examples >>> import heat as ht >>> X = ht.array([[1, 2], [3, 4]], dtype=ht.float) >>> labels = ht.array([0, 1]) >>> _validate_input(X, labels) (DNDarray([[1., 2.], [3., 4.]], dtype=ht.float32, device=cpu:0, split=None), DNDarray([0, 1], dtype=ht.int64, device=cpu:0, split=None)) .. function:: silhouette_samples(X, labels, *, metric='euclidean') Compute the Silhouette Coefficient for each sample. The Silhouette Coefficient is a measure of how close an object is to its own cluster (cohesion) compared to other clusters (separation). The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. * The score is 0 for clusters with only a single sample. * The calculation involves computing the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. Parameters ---------- X : DNDarray An array of pairwise distances between samples, or a feature array. If `metric='precomputed'`, X is assumed to be a distance matrix and a feature array otherwise. labels : DNDarray Labels for each sample. metric : str, optional The metric to use when calculating distance between instances in a feature array. If metric is "precomputed", X is assumed to be a distance matrix. Default is "euclidean". Returns ------- DNDarray Silhouette value of all individual samples in the clustering Notes ----- The Silhouette Coefficient $s(i)$ for a single sample is defined as: $$s(i) = rac{b(i) - a(i)}{\max(a(i), b(i))}$$ where $a(i)$ is the mean distance to other samples in the same cluster and $b(i)$ is the mean distance to samples in the nearest neighbor cluster. Raises ------ ValueError If `metric='precomputed'` and the diagonal contains non-zero elements. See Also -------- silhouette_score : Average silhouette coefficient over all samples. Examples -------- >>> import heat as ht >>> X = ht.array([[1, 2], [1, 1], [4, 4], [4, 5]], split=0) >>> labels = ht.array([0, 0, 1, 1], split=0) >>> ht.cluster.silhouette_samples(X, labels) DNDarray([0.7452, 0.7836, 0.7452, 0.7836], dtype=ht.float64, device=cpu:0, split=0) .. function:: silhouette_score(X, labels, *, metric='euclidean', sample_size=None, random_state=None, **kwargs) Compute the mean Silhouette Coefficient of all samples. The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is $(b - a) / \max(a, b)$. * This function returns the average of `silhouette_samples`. * To clarify, $b$ is the distance between a sample and the nearest cluster that the sample is not a part of. :param X: An array of pairwise distances between samples, or a feature array. :type X: DNDarray :param labels: Labels for each sample. :type labels: DNDarray :param metric: The metric to use when calculating distance between instances in a feature array. If metric is "precomputed", X is assumed to be a distance matrix. Default is "euclidean". :type metric: str, optional :param sample_size: The size of the sample to use when computing the Silhouette Coefficient on a random subset of the data. If ``sample_size is None``, no sampling is used. :type sample_size: int, optional :param random_state: Determines random number generation for selecting a subset of samples. Used when `sample_size` is not `None`. :type random_state: int, optional :param \*\*kwargs: Additional keyword arguments passed to `silhouette_samples`. :type \*\*kwargs: optional :returns: Silhouette score of the clustering :rtype: float .. rubric:: Notes The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. .. seealso:: :py:obj:`silhouette_samples` Silhouette Coefficient for each individual sample. .. rubric:: Examples >>> import heat as ht >>> X = ht.array([[1, 2], [1, 1], [4, 4], [4, 5]], split=0) >>> labels = ht.array([0, 0, 1, 1], split=0) >>> ht.cluster.silhouette_score(X, labels) 0.76439