KMeans clustering - Finding your centre

KMeans clustering is a method to partition your observations into k clusters with k centroids that describes the centre of each cluster. When given a new observation, it is part of a cluster if it is closest to the centroid of that cluster. The diagram above illustrates the k-means clustering concept.

The KMeans approach starts by deciding the number of clusters you wish. Then you estimate where the centroids of each cluster might be located. The distance of each observation to each centroid is calculated. Then each observation is re-clustered to the closest centroid. For each new cluster, we re-calculate a new centroid by averaging the cluster data by each feature. We repeat this cycle until no further refinement is achieved.

Since Excel LAMBDA does not have iterative loops, a recursive approach will be used.

Implementing KMeans clustering in LAMBDA

With k-means clustering we implement Predict before Fit.

Predict

Predict takes a list of observations array and a list of centroids. For each observation, its Pythagorean distance to each centroid is calculated using SUMSQ. This observation is clustered with the closest centroid. The implementation is shown below.

dcrML.KMeans.Predict
=LAMBDA(array, centroids, [showDetails],
  LET(
    distFromCentroids, MAKEARRAY(ROWS(array), ROWS(centroids), LAMBDA(nRow, nCol, SUMSQ(CHOOSEROWS(array,nRow)-CHOOSEROWS(centroids,nCol)))),
    nearestCentroids, BYROW(distFromCentroids, LAMBDA(pRow, MATCH(SMALL(pRow, 1), pRow, 0))),
    details, IF(showDetails,
      HSTACK(
        VSTACK("Distances", distFromCentroids),
        VSTACK("Nearest", nearestCentroids)
      ),
      nearestCentroids
    ),
    IFNA(details,"")
  )
)

NOTE: The MAKEARRAY calculates the distance (SUMSQ) for each observation (row: nRow) to each centroid (column: nCol).
NOTE: Technically we only calculated the square of the distances. Since we are not interested in the actual distance, just the comparative magnitudes, taking the square root of this an unnecessary step.
NOTE: Calculating the Pythagorean distance was chosen. However this can be quite computationally intensive if you have a lot of data. Alternative is to choose Hamming or Manhattan distance.

Fit

Fit will be recursive to refine its calculations on the road to finding the centroids. Apart from using Predict and dcrML.Help.GetHeaders, it is pretty much self contained. The implementation is shown below.

dcrML.KMeans.Fit
=LAMBDA(array, numOfCentroids, [headers], [centroids], [prevNearestCentroids], [showDetails],
  LET(
    headers, dcrML.Help.GetHeaders(array, headers),
    centroids, IF(ISOMITTED(centroids), CHOOSEROWS(array, RANDARRAY(numOfCentroids,,1, ROWS(array),TRUE)), centroids),
    nearestCentroids, dcrML.KMeans.Predict(array, centroids, FALSE),

    newCentroids, MAKEARRAY(numOfCentroids, COLUMNS(array),
      LAMBDA(nRow, nCol,
          AVERAGE( CHOOSECOLS( FILTER(array, nearestCentroids=nRow), nCol) )
      )
    ),

    nDiffCentroids, IF(ISOMITTED(prevNearestCentroids), 1000, SUM(IF(prevNearestCentroids = nearestCentroids, 0, 1))),

    details, IF(nDiffCentroids <> 0,
      IF(showDetails,
        HSTACK(
          VSTACK("Nearest", nearestCentroids),
          VSTACK("# Diff", nDiffCentroids),
          VSTACK(headers, newCentroids),
          dcrML.KMeans.Fit(array, numOfCentroids, headers, newCentroids, nearestCentroids, showDetails)
        ),
        dcrML.KMeans.Fit(array, numOfCentroids, headers, newCentroids, nearestCentroids, showDetails)
      ),
      IF(showDetails,
        HSTACK(
          VSTACK("Nearest", nearestCentroids),
          VSTACK("# Diff", nDiffCentroids),
          VSTACK(headers, newCentroids)
        ),
        VSTACK(headers, newCentroids)
      )
    ),

    IFNA(details,"")
  )
)

Let's see it in action

For demonstration, the commonly available Iris data will be used.

The figure below illustrates how dcrML.KMeans.Fit takes in parameters. We supply the observation data (A3#), the expected number of centroids (3) and the observation headers (A2#). I also indicate to see the details (TRUE).

Scrolling to the right, we see the derived centroids. Usually the details are not required, in which case you may omit showDetails = TRUE. This would then return you just the centroids.

Zooming out we see the function was called 7 times before reaching the recursive condition.

The number of recursive calls is not easily repeatable because the starting conditions uses RANDARRAY. But the function can arrive at the same end result, especially when the natural clusters are nicely spaced out.

Visualisation

To help us visualise the clustering, I will limit ourselves to only 2 features: Sepal Length and Petal Length. These were chosen specifically because they have very strong correlation (see this article).

The chart below is a scatter plot of Sepal Length vs Petal Length for the different natural clusters: Iris-sentosa, Iris-versicolor, and Iris-virginica. Roughly in the middle of these plots are the centroids calculated using the k-means clustering approach. We can visually estimate that they are the centroids of these natural cluster.

However, repeating the function call a few times occasionally returns unexpected results like the one below. Here you can see 2 of the calculated centroids are within the same natural cluster, while only 1 is used to describe 2 clusters.

Questions and Answers

Why did the KMeans function returned an unexpected result of 2 centroids within a single cluster?

Part of the reason is because we randomly chose our starting centroids. Hence it is entire possible to end up with 2 starting points within a single natural cluster. Also, if the natural clusters are sufficiently "spaced out", then evaluating the nearest neighbours merely attempts to segment that single cluster into 2 clusters.

Is there a better way to improve this?

It depends if you know roughly where the centroids are located. If you can approximate the "actual" centroids then passing these into the function would increase your chances to reach the "actual" ones. I say "actual" because finding centroids is merely creating a model of what we want to see.

If the number of clusters is unknown at the start, then the selection of k is guess work.

If there's only a single cluster, then requesting for 3 centroids is merely further partitioning that cluster into 3.

Also if the clusters are not spaced out (due to insufficient distinguishing features), then it equally valid to say there's only a single cluster.

How would you then know you have clustered correctly?

This is a good question. Repeating the calculation may move you closer to the correct centroids in the natural clusters. Also varying k, the number of expected centroids, would help predict the number of clusters you should be seeing.

If the natural cluster is "nicely" separated, repeating the calculations will return the same set of centroids.

What if you requested more centroids than the original number of groupings?

If you requested 3 centroids when there is only a single cluster, the method attempts to further partition the cluster into 3 sub-clusters. This is why KMeans is considered unsupervised, i.e. it doesn't really know. It just tries as best as it is told.

Conclusion

It is not all doom and gloom. You have to test it out a few rounds. You may need to plot some visuals to gain confidence. But all in all, the KMeans clustering concept is sounds and is still a good unsupervised clustering approach.

NOTE: In my earlier attempts implementing data mining techniques into LAMBDA functions, I created a number of custom functions. This was inelegant. Revisiting, I tidied them up, so that they are more self contained. I use the supporting functions but only because their steps are used over and over in different examples.

Stay tune for more data mining techniques from DC-DEN.

Search This Blog

DC-Den