Cluster Analysis

This tutorial will demonstrate analysis with k-means and k-medians from the cluster module. We will use matplotlib for visualization of data and results.

import heat as ht
import matplotlib.pyplot as plt

Spherical Clouds of Datapoints

For a simple demonstration of the clustering process and the differences between the algorithms, we will create an artificial dataset, consisting of two circularly shaped clusters positioned at \((x_1=2, y_1=2)\) and \((x_2=-2, y_2=-2)\) in 2D space. For each cluster we will sample 100 arbitrary points from a circle with radius of \(R = 1.0\) by drawing random numbers for the spherical coordinates \(( r\in [0,R], \phi \in [0,2\pi])\), translating these to cartesian coordinates and shifting them by \(+2\) for cluster c1 and \(-2\) for cluster c2. The resulting concatenated dataset data has shape \((200, 2)\) and is distributed among the p processes along axis 0 (sample axis)

num_ele = 100
R = 1.0

# Create default spherical point cloud
# Sample radius between 0 and 1, and phi between 0 and 2pi
r = ht.random.rand(num_ele, split=0) * R
phi = ht.random.rand(num_ele, split=0) * 2 * ht.constants.PI

# Transform spherical coordinates to cartesian coordinates
x = r * ht.cos(phi)
y = r * ht.sin(phi)


# Stack the sampled points and shift them to locations (2,2) and (-2, -2)
cluster1 = ht.stack((x + 2, y + 2), axis=1)
cluster2 = ht.stack((x - 2, y - 2), axis=1)

data = ht.concatenate((cluster1, cluster2), axis=0)

Let’s plot the data for illustration. In order to do so with matplotlib, we need to unsplit the data (gather it from all processes) and transform it into a numpy array. Plotting can only be done on rank 0.

data_np = ht.resplit(data, axis=None).numpy()
if ht.MPI_WORLD.rank == 0:
    plt.plot(data_np[:,0], data_np[:,1], 'bo')
    plt.show()

This will render something like

Now we perform the clustering analysis with kmeans. We chose ‘kmeans++’ as an intelligent way of sampling the initial centroids.

kmeans = ht.cluster.KMeans(n_clusters=2, init="kmeans++")
labels = kmeans.fit_predict(data).squeeze()
centroids = kmeans.cluster_centers_

# Select points assigned to clusters c1 and c2
c1 = data[ht.where(labels == 0), :]
c2 = data[ht.where(labels == 1), :]
# After slicing, the arrays are not distributed equally among the processes anymore; we need to balance
c1.balance_()
c2.balance_()

print(f"Number of points assigned to c1: {c1.shape[0]} \n"
      f"Number of points assigned to c2: {c2.shape[0]} \n"
      f"Centroids = {centroids}")

Number of points assigned to c1: 100
Number of points assigned to c2: 100
Centroids =  DNDarray([[ 2.0169,  2.0713],
                       [-1.9831, -1.9287]], dtype=ht.float32, device=cpu:0, split=None)

Let’s plot the assigned clusters and the respective centroids:

c1_np = c1.numpy()
c2_np = c2.numpy()

if ht.MPI_WORLD.rank == 0:
    plt.plot(c1_np[:,0], c1_np[:,1], 'x', color='#f0781e')
    plt.plot(c2_np[:,0], c2_np[:,1], 'x', color='#5a696e')
    plt.plot(centroids[0,0],centroids[0,1], '^', markersize=10, markeredgecolor='black', color='#f0781e' )
    plt.plot(centroids[1,0],centroids[1,1], '^', markersize=10, markeredgecolor='black',color='#5a696e')
    plt.show()

We can also cluster the data with kmedians. The respective advanced initial centroid sampling is called ‘kmedians++’.

kmedians = ht.cluster.KMedians(n_clusters=2, init="kmedians++")
labels = kmedians.fit_predict(data).squeeze()
centroids = kmedians.cluster_centers_

# Select points assigned to clusters c1 and c2
c1 = data[ht.where(labels == 0), :]
c2 = data[ht.where(labels == 1), :]
# After slicing, the arrays are not distributed equally among the processes anymore; we need to balance
c1.balance_()
c2.balance_()

print(f"Number of points assigned to c1: {c1.shape[0]} \n"
      f"Number of points assigned to c2: {c2.shape[0]} \n"
      f"Centroids = {centroids}")

Plotting the assigned clusters and the respective centroids:

c1_np = c1.numpy()
c2_np = c2.numpy()
if ht.MPI_WORLD.rank == 0:
    plt.plot(c1_np[:,0], c1_np[:,1], 'x', color='#f0781e')
    plt.plot(c2_np[:,0], c2_np[:,1], 'x', color='#5a696e')
    plt.plot(centroids[0,0],centroids[0,1], '^', markersize=10, markeredgecolor='black', color='#f0781e' )
    plt.plot(centroids[1,0],centroids[1,1], '^', markersize=10, markeredgecolor='black',color='#5a696e')
    plt.show()

The Iris Dataset

The _iris_ dataset is a well known example for clustering analysis. It contains 4 measured features for samples from three different types of iris flowers. A subset of 150 samples is included in formats h5, csv and netcdf in Heat, located under ‘heat/heat/datasets’, and can be loaded in a distributed manner with Heat’s parallel dataloader

iris = ht.load("heat/datasets/iris.csv", sep=";", split=0)

Fitting the dataset with kmeans:

k = 3
kmeans = ht.cluster.KMeans(n_clusters=k, init="kmeans++")
kmeans.fit(iris)

Let’s see what the results are. In theory, there are 50 samples of each of the 3 iris types

labels = kmeans.predict(iris).squeeze()

# Select points assigned to clusters c1 and c2
c1 = iris[ht.where(labels == 0), :]
c2 = iris[ht.where(labels == 1), :]
c3 = iris[ht.where(labels == 2), :]
# After slicing, the arrays are not distributed equally among the processes anymore; we need to balance
c1.balance_()
c2.balance_()
c3.balance_()

print(f"Number of points assigned to c1: {c1.shape[0]} \n"
      f"Number of points assigned to c2: {c2.shape[0]} \n"
      f"Number of points assigned to c3: {c3.shape[0]}")