heat.cluster.metrics

Cluster metrics for the Heat library.

Module Contents

_validate_input(X, labels, metric='euclidean')[source]

Input validation for clustering metrics. Converts input to DNDarray if needed.

Parameters:
  • X ({DNDarray, list}) – Input data.

  • labels ({DNDarray, list}) – Labels.

  • metric (str, optional) – The metric to use for validation. Default is “euclidean”.

Returns:

  • X (DNDarray) – The converted and validated X.

  • labels (DNDarray) – The converted and validated labels.

Examples

>>> import heat as ht
>>> X = ht.array([[1, 2], [3, 4]], dtype=ht.float)
>>> labels = ht.array([0, 1])
>>> _validate_input(X, labels)
(DNDarray([[1., 2.], [3., 4.]], dtype=ht.float32, device=cpu:0, split=None),
 DNDarray([0, 1], dtype=ht.int64, device=cpu:0, split=None))
silhouette_samples(X, labels, *, metric='euclidean')[source]

Compute the Silhouette Coefficient for each sample.

The Silhouette Coefficient is a measure of how close an object is to its own cluster (cohesion) compared to other clusters (separation). The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters.

  • The score is 0 for clusters with only a single sample.

  • The calculation involves computing the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample.

XDNDarray

An array of pairwise distances between samples, or a feature array. If metric=’precomputed’, X is assumed to be a distance matrix and a feature array otherwise.

labelsDNDarray

Labels for each sample.

metricstr, optional

The metric to use when calculating distance between instances in a feature array. If metric is “precomputed”, X is assumed to be a distance matrix. Default is “euclidean”.

DNDarray

Silhouette value of all individual samples in the clustering

The Silhouette Coefficient $s(i)$ for a single sample is defined as: $$s(i) =

rac{b(i) - a(i)}{max(a(i), b(i))}$$

where $a(i)$ is the mean distance to other samples in the same cluster and $b(i)$ is the mean distance to samples in the nearest neighbor cluster.

ValueError

If metric=’precomputed’ and the diagonal contains non-zero elements.

silhouette_score : Average silhouette coefficient over all samples.

>>> import heat as ht
>>> X = ht.array([[1, 2], [1, 1], [4, 4], [4, 5]], split=0)
>>> labels = ht.array([0, 0, 1, 1], split=0)
>>> ht.cluster.silhouette_samples(X, labels)
DNDarray([0.7452, 0.7836, 0.7452, 0.7836], dtype=ht.float64, device=cpu:0, split=0)
silhouette_score(X, labels, *, metric='euclidean', sample_size=None, random_state=None, **kwargs)[source]

Compute the mean Silhouette Coefficient of all samples.

The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is $(b - a) / max(a, b)$.

  • This function returns the average of silhouette_samples.

  • To clarify, $b$ is the distance between a sample and the nearest cluster that the sample is not a part of.

Parameters:
  • X (DNDarray) – An array of pairwise distances between samples, or a feature array.

  • labels (DNDarray) – Labels for each sample.

  • metric (str, optional) – The metric to use when calculating distance between instances in a feature array. If metric is “precomputed”, X is assumed to be a distance matrix. Default is “euclidean”.

  • sample_size (int, optional) – The size of the sample to use when computing the Silhouette Coefficient on a random subset of the data. If sample_size is None, no sampling is used.

  • random_state (int, optional) – Determines random number generation for selecting a subset of samples. Used when sample_size is not None.

  • **kwargs (optional) – Additional keyword arguments passed to silhouette_samples.

Returns:

Silhouette score of the clustering

Return type:

float

Notes

The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters.

See also

silhouette_samples

Silhouette Coefficient for each individual sample.

Examples

>>> import heat as ht
>>> X = ht.array([[1, 2], [1, 1], [4, 4], [4, 5]], split=0)
>>> labels = ht.array([0, 0, 1, 1], split=0)
>>> ht.cluster.silhouette_score(X, labels)
0.76439