Cluster Analysis
================

This tutorial will demonstrate analysis with k-means and k-medians from the ``cluster`` module.
We will use matplotlib for visualization of data and results. ::

    import heat as ht
    import matplotlib.pyplot as plt


Spherical Clouds of Datapoints
------------------------------
For a simple demonstration of the clustering process and the differences between the algorithms, we will create an
artificial dataset, consisting of two circularly shaped clusters positioned at :math:`(x_1=2, y_1=2)` and :math:`(x_2=-2, y_2=-2)` in 2D space.
For each cluster we will sample 100 arbitrary points from a circle with radius of :math:`R = 1.0` by drawing random numbers
for the spherical coordinates :math:`( r\in [0,R], \phi \in [0,2\pi])`, translating these to cartesian coordinates
and shifting them by :math:`+2` for cluster ``c1`` and :math:`-2` for cluster ``c2``. The resulting concatenated dataset ``data`` has shape
:math:`(200, 2)` and is distributed among the ``p`` processes along axis 0 (sample axis)

.. code:: python

    num_ele = 100
    R = 1.0

    # Create default spherical point cloud
    # Sample radius between 0 and 1, and phi between 0 and 2pi
    r = ht.random.rand(num_ele, split=0) * R
    phi = ht.random.rand(num_ele, split=0) * 2 * ht.constants.PI

    # Transform spherical coordinates to cartesian coordinates
    x = r * ht.cos(phi)
    y = r * ht.sin(phi)


    # Stack the sampled points and shift them to locations (2,2) and (-2, -2)
    cluster1 = ht.stack((x + 2, y + 2), axis=1)
    cluster2 = ht.stack((x - 2, y - 2), axis=1)

    data = ht.concatenate((cluster1, cluster2), axis=0)

Let's plot the data for illustration. In order to do so with matplotlib, we need to unsplit the data (gather it from
all processes) and transform it into a numpy array. Plotting can only be done on rank 0.

.. code:: python

    data_np = ht.resplit(data, axis=None).numpy()
    if ht.MPI_WORLD.rank == 0:
        plt.plot(data_np[:,0], data_np[:,1], 'bo')
        plt.show()

This will render something like

.. image:: ../_static/images/data.png

Now we perform the clustering analysis with kmeans. We chose 'kmeans++' as an intelligent way of sampling the
initial centroids.

.. code:: python

    kmeans = ht.cluster.KMeans(n_clusters=2, init="kmeans++")
    labels = kmeans.fit_predict(data).squeeze()
    centroids = kmeans.cluster_centers_

    # Select points assigned to clusters c1 and c2
    c1 = data[ht.where(labels == 0), :]
    c2 = data[ht.where(labels == 1), :]
    # After slicing, the arrays are not distributed equally among the processes anymore; we need to balance
    c1.balance_()
    c2.balance_()

    print(f"Number of points assigned to c1: {c1.shape[0]} \n"
          f"Number of points assigned to c2: {c2.shape[0]} \n"
          f"Centroids = {centroids}")

.. code:: text

    Number of points assigned to c1: 100
    Number of points assigned to c2: 100
    Centroids =  DNDarray([[ 2.0169,  2.0713],
                           [-1.9831, -1.9287]], dtype=ht.float32, device=cpu:0, split=None)

Let's plot the assigned clusters and the respective centroids:

.. code:: python

    c1_np = c1.numpy()
    c2_np = c2.numpy()

    if ht.MPI_WORLD.rank == 0:
        plt.plot(c1_np[:,0], c1_np[:,1], 'x', color='#f0781e')
        plt.plot(c2_np[:,0], c2_np[:,1], 'x', color='#5a696e')
        plt.plot(centroids[0,0],centroids[0,1], '^', markersize=10, markeredgecolor='black', color='#f0781e' )
        plt.plot(centroids[1,0],centroids[1,1], '^', markersize=10, markeredgecolor='black',color='#5a696e')
        plt.show()

.. image:: ../_static/images/clustering.png

We can also cluster the data with kmedians. The respective advanced initial centroid sampling is called 'kmedians++'.

.. code:: python

    kmedians = ht.cluster.KMedians(n_clusters=2, init="kmedians++")
    labels = kmedians.fit_predict(data).squeeze()
    centroids = kmedians.cluster_centers_

    # Select points assigned to clusters c1 and c2
    c1 = data[ht.where(labels == 0), :]
    c2 = data[ht.where(labels == 1), :]
    # After slicing, the arrays are not distributed equally among the processes anymore; we need to balance
    c1.balance_()
    c2.balance_()

    print(f"Number of points assigned to c1: {c1.shape[0]} \n"
          f"Number of points assigned to c2: {c2.shape[0]} \n"
          f"Centroids = {centroids}")

Plotting the assigned clusters and the respective centroids:

.. code:: python

    c1_np = c1.numpy()
    c2_np = c2.numpy()
    if ht.MPI_WORLD.rank == 0:
        plt.plot(c1_np[:,0], c1_np[:,1], 'x', color='#f0781e')
        plt.plot(c2_np[:,0], c2_np[:,1], 'x', color='#5a696e')
        plt.plot(centroids[0,0],centroids[0,1], '^', markersize=10, markeredgecolor='black', color='#f0781e' )
        plt.plot(centroids[1,0],centroids[1,1], '^', markersize=10, markeredgecolor='black',color='#5a696e')
        plt.show()

.. image:: ../_static/images/clustering_kmeans.png

The Iris Dataset
------------------------------
The _iris_ dataset is a well known example for clustering analysis. It contains 4 measured features for samples from
three different types of iris flowers. A subset of 150 samples is included in formats h5, csv and netcdf in Heat,
located under 'heat/heat/datasets', and can be loaded in a distributed manner with Heat's parallel
dataloader

.. code:: python

    iris = ht.load("heat/datasets/iris.csv", sep=";", split=0)


Fitting the dataset with kmeans:

.. code:: python

    k = 3
    kmeans = ht.cluster.KMeans(n_clusters=k, init="kmeans++")
    kmeans.fit(iris)

Let's see what the results are. In theory, there are 50 samples of each of the 3 iris types

.. code:: python

    labels = kmeans.predict(iris).squeeze()

    # Select points assigned to clusters c1 and c2
    c1 = iris[ht.where(labels == 0), :]
    c2 = iris[ht.where(labels == 1), :]
    c3 = iris[ht.where(labels == 2), :]
    # After slicing, the arrays are not distributed equally among the processes anymore; we need to balance
    c1.balance_()
    c2.balance_()
    c3.balance_()

    print(f"Number of points assigned to c1: {c1.shape[0]} \n"
          f"Number of points assigned to c2: {c2.shape[0]} \n"
          f"Number of points assigned to c3: {c3.shape[0]}")