heat.cluster.metrics
Cluster metrics for the Heat library.
Module Contents
- _validate_input(X, labels, metric='euclidean')[source]
Input validation for clustering metrics. Converts input to DNDarray if needed.
- Parameters:
X ({DNDarray, list}) – Input data.
labels ({DNDarray, list}) – Labels.
metric (str, optional) – The metric to use for validation. Default is “euclidean”.
- Returns:
X (DNDarray) – The converted and validated X.
labels (DNDarray) – The converted and validated labels.
Examples
>>> import heat as ht >>> X = ht.array([[1, 2], [3, 4]], dtype=ht.float) >>> labels = ht.array([0, 1]) >>> _validate_input(X, labels) (DNDarray([[1., 2.], [3., 4.]], dtype=ht.float32, device=cpu:0, split=None), DNDarray([0, 1], dtype=ht.int64, device=cpu:0, split=None))
- silhouette_samples(X, labels, *, metric='euclidean')[source]
Compute the Silhouette Coefficient for each sample.
The Silhouette Coefficient is a measure of how close an object is to its own cluster (cohesion) compared to other clusters (separation). The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters.
The score is 0 for clusters with only a single sample.
The calculation involves computing the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample.
- XDNDarray
An array of pairwise distances between samples, or a feature array. If metric=’precomputed’, X is assumed to be a distance matrix and a feature array otherwise.
- labelsDNDarray
Labels for each sample.
- metricstr, optional
The metric to use when calculating distance between instances in a feature array. If metric is “precomputed”, X is assumed to be a distance matrix. Default is “euclidean”.
- DNDarray
Silhouette value of all individual samples in the clustering
The Silhouette Coefficient $s(i)$ for a single sample is defined as: $$s(i) =
- rac{b(i) - a(i)}{max(a(i), b(i))}$$
where $a(i)$ is the mean distance to other samples in the same cluster and $b(i)$ is the mean distance to samples in the nearest neighbor cluster.
- ValueError
If metric=’precomputed’ and the diagonal contains non-zero elements.
silhouette_score : Average silhouette coefficient over all samples.
>>> import heat as ht >>> X = ht.array([[1, 2], [1, 1], [4, 4], [4, 5]], split=0) >>> labels = ht.array([0, 0, 1, 1], split=0) >>> ht.cluster.silhouette_samples(X, labels) DNDarray([0.7452, 0.7836, 0.7452, 0.7836], dtype=ht.float64, device=cpu:0, split=0)
- silhouette_score(X, labels, *, metric='euclidean', sample_size=None, random_state=None, **kwargs)[source]
Compute the mean Silhouette Coefficient of all samples.
The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is $(b - a) / max(a, b)$.
This function returns the average of silhouette_samples.
To clarify, $b$ is the distance between a sample and the nearest cluster that the sample is not a part of.
- Parameters:
X (DNDarray) – An array of pairwise distances between samples, or a feature array.
labels (DNDarray) – Labels for each sample.
metric (str, optional) – The metric to use when calculating distance between instances in a feature array. If metric is “precomputed”, X is assumed to be a distance matrix. Default is “euclidean”.
sample_size (int, optional) – The size of the sample to use when computing the Silhouette Coefficient on a random subset of the data. If
sample_size is None, no sampling is used.random_state (int, optional) – Determines random number generation for selecting a subset of samples. Used when sample_size is not None.
**kwargs (optional) – Additional keyword arguments passed to silhouette_samples.
- Returns:
Silhouette score of the clustering
- Return type:
float
Notes
The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters.
See also
silhouette_samplesSilhouette Coefficient for each individual sample.
Examples
>>> import heat as ht >>> X = ht.array([[1, 2], [1, 1], [4, 4], [4, 5]], split=0) >>> labels = ht.array([0, 0, 1, 1], split=0) >>> ht.cluster.silhouette_score(X, labels) 0.76439