Who are your neighbours? Classification with KNN

The closer an observation to a class, the more likely it belongs to that class.

KNN or K-Nearest Neighbour is a non-parametric supervised learning classification algorithm. It uses proximity to make classifications about the grouping of an observation point.

How do you measure proximity?

You may remember Pythagoras' theorem from your school days. The distance between two points is the square root of the sum of the squares of the sides. This would be one way to define proximity. But we could define proximity differently. Given two points (`x_1`, `x_2`, `x_3`) and (`y_1`, `y_2`, `y_3`)

Euclidean distance `= sqrt( (y_1-x_1)^2 + (y_2-x_2)^2 + (y_3-x_3)^2`

Manhattan distance: `= |y_1-x_1| + |y_2-x_2| + |y_3-x_3|`

Chebyshev distance: `= max(|y_1-x_1|, |y_2-x_2|, |y_3-x_3|)`

Each of these definitions have their pros and cons. For our implementation, I will be using Euclidean distance. However if you wish to reduce the computation complexity, you might want to try Manhattan or Chebyshev.

Implementing KNN in Excel Lambda

KNN is not difficult. In a nutshell, it is about calculating the distances between a new observation to the training data, observing the k-nearest neighbours, and choosing the majority label to classify the new observation.

Fit

The function name of this step is a bit misleading. If we are only interested to calculate the distances between a new observation and the training data, we could do that directly. The intended misnomer is to be consistent with the Fit-Predict pair that I have been using.

In Fit therefore, we want is to pre-calculate the normalised data of each features, such that they are all on the same scale 0 - 100. Normalising the features ensures each feature is treated the same. Otherwise in the Predict, features with values in the higher range will overwhelm features with values in the lower range. To normalise each features we use the variance of the feature.

dcrML.KNN.Fit
=LAMBDA(arrayData, arrayClass, [dataHeaders], [showDetails],
  LET(
    dataHeaders, dcrML.Help.GetHeaders(arrayData, dataHeaders),
    colMax, BYCOL(arrayData, LAMBDA(pCol, MAX(pCol))),
    colMin, BYCOL(arrayData, LAMBDA(pCol, MIN(pCol))),
    colFactor, 100/(colMax - colMin),

    normArray, colFactor * (arrayData - colMin),

    details, HSTACK(
      "Normalise Vector",
      TRANSPOSE(VSTACK(
        HSTACK("Max", colMax),
        HSTACK("Min", colMin),
        HSTACK("Factor", colFactor)
      )),
      "Normalised Array",
      VSTACK(dataHeaders, normArray),
      VSTACK("Class", arrayClass)
    ),

    IFNA(details,"")
  )
)

Predict

In this section, we will calculate the distance of a given observation to each training data observation. And then we choose k-nearest neighbours, and the class majority of these neighbours, is used as the label for the given observation. For this reason k is usually an odd number.

dcrML.KNN.Predict
=LAMBDA(normVec, normArray, normClass, arrayData, k, [dataHeaders], [showDetails],
  LET(
    dataHeaders, dcrML.Help.GetHeaders(arrayData, dataHeaders),
    colMin, TOROW(CHOOSECOLS(normVec, 2)),
    colFactor, TOROW(CHOOSECOLS(normVec, 3)),
    normPredict, colFactor * (arrayData - colMin),

    normClass, BYROW(normPredict,
      LAMBDA(pPredict,
        LET(
          distances, BYROW(normArray - pPredict, LAMBDA(pRow, SQRT(SUMSQ(pRow)))),
          reciprocal, 1/(1+distances),
          weight, reciprocal/SUM(reciprocal),

          kDistance, SMALL(distances, k),
          classes, TOROW(UNIQUE(normClass)),
          kClasses, BYCOL(classes, LAMBDA(pCol, SUMPRODUCT(weight, --(normClass=pCol), --(distances <= kDistance) ))),
          INDEX(classes, 1, MATCH( MAX(kClasses), kClasses, 0))
        )
      )
    ),

    details, IF(showDetails,
      HSTACK(
        "Normalised",
        VSTACK(dataHeaders, normPredict),
        VSTACK("Predict", normClass)
      ),
      normClass
    ),

    IFNA(details,"")
  )
)

Why is there a weight variable here? We want to give preference to training observations that are closer over those that are farther. We do this using the weight factor.

Seeing KNN in Action

We will use the Iris Training and Validation data for this KNN demonstration.

Fit

Below is a screenshot of the Fit function. The training data is on the left. Fit calculates the Max, Min and Factor values and also normalises the training data on the right.

Predict

With the normalised data, we can now KNN classify the validation data and also calculate its accuracy. Below is a screenshot of the Predict function in action. The validation data is on the left, and the prediction is in the middle. The right side is comparing the prediction to the validation to measure the accuracy.

Questions and Answers

What is the downside with KNN?

KNN can be computationally intensive. To classify a new observation point, you would have to calculate its distance to all training data observations. If you had 1 million data observations, that would be 1 million distances to calculate. And in the implementation above the Euclidean distances were calculated, i.e.

          distances, BYROW(normArray - pPredict, LAMBDA(pRow, SQRT(SUMSQ(pRow)))),

Would changing the distance calculation method help?

Yes, changing the distance calculation method to Manhattan or Chebyshev would help speed up the computation time. But the number of steps remains because the number of distances to calculate remains.

So, why is KNN still being used?

It is very accurate. It can handle numerical and categorical features - the latter have to be binned or mapped to a value. And unlike Linear Discriminate Analysis LDA that have strict linear boundaries, KNN would be able to handle arbitrary observations better.

So if you want need speed or you lack computational resources, use LDA. But if you want accuracy, KNN may be better.

Conclusion

As you can see, KNN is not difficult. The tricky bits are normalising the training data and calculating the weightages. You should also take heed of some of the pointers addressed in the Q&A above.

We are coming to an end on Data Mining techniques. There are other topics like logistic regression, neural networks, etc. I am still trying to figure how to implement numerical methods when there is no analytical solutions. In Excel you have GoalSeek and Solver to do this, but it is a lot more tricky with Lambda.

Next post with be on Naive Bayes classification. In the meantime, go forth and KNN classify with DC-DEN!

Side comment: After a few data mining implementations, I find Lambda sorely lacking the inline code comment feature. This makes documentation very difficult.

Search This Blog

DC-Den