It's Probably Correct - Classifying with Naïve Bayes

Naïve Bayes is a categorial probabilistic supervised classification. You may already be familiar with the terminology supervised classification, so I will not repeat it here. Naïve Bayes doesn't require numerical values. It relies on categories or labels only. It is probabilistic because it uses probabilities to calculate the classification. It calculates the classification probabilities based on given records. The more consistent the data (repeatable patterns), the stronger the probability of the classification. 

Note: This implementation and example here follows closely from Learn Data Mining Through Excel by Hong Zhou. Great book!

Bayes Theorem

Naïve Bayes is based on Bayes theorem most famously written as below:

`P(y|x) = (P(x|y)*P(y)) / (P(x))`

where:

`P(y|x)` is the probability of `y` given `x`

`P(x|y)` probability of `x` given `y`

`P(y)` and `P(x)` are the probabilities of y and x respectively.

With multi-independent variables `(x_1, x_2, ..., x_n)` the equation would be re-written as: 

`P(y_k|x) = (P(x_1|y_k)*P(x_2|y_k)*...P(x_n|y_k) ) / (P(x))`

where:

`y_k` is the k-th classification.

Don't get bogged down trying to understand the equations. What is important is how to calculate these probabilities.

What is Naïve Bayes?

Naïve Bayes classification uses Bayes theorem to calculate the probabilities of different classifications based on a history of given records or prior records. When a new record is encountered, the likelihoods of that record being of a certain class are calculated and the highest probability is chosen to classify that record.

It is called naïve because it assumes the features (variables) are independent of each other. This is a very strong assumption because features are often not independent of each other. For now we will assume independent variables.

Implementing Naïve Bayes learning in Lambda

Let's implement this algorithm in Excel Lambda.

Fit

The Lambda implementation below may look cryptic. But the challenge is really on preparing the array (flattening in arrayFlatMatrix and classFlatMatrix) to calculate the probabilities in arrayCatProb.

dcrML.NaiveBayes.Fit
=LAMBDA(arrayData, arrayClass, [dataHeaders], [showDetails],
  LET(
    arrayHeaders, dcrML.Help.GetHeaders(arrayData, dataHeaders),

    arrayFlat, TOCOL(arrayHeaders & ":" & arrayData, 3, TRUE),
    numFeatures, COLUMNS(arrayData),
    classFlat, TOCOL(CHOOSECOLS(arrayClass, 1, SEQUENCE(numFeatures-1,1,1,0)), 3, TRUE),

    numData, ROWS(arrayData),
    classCat, TOROW(SORT(UNIQUE(arrayClass))),
    arrayCat, UNIQUE(arrayFlat),

    classCatCount, COUNTIF(arrayClass, classCat),
    classCatProb, classCatCount / numData,

    arrayFlatMatrix, N(arrayFlat=TRANSPOSE(arrayCat)),
    classFlatMatrix, N(classFlat=classCat),
    arrayCatSize, COUNTA(arrayCat),
    classCatSize, COUNTA(classCat),

    arrayCatCount, MAKEARRAY(arrayCatSize, classCatSize, LAMBDA(r, c, SUM(CHOOSECOLS(arrayFlatMatrix,r) * CHOOSECOLS(classFlatMatrix,c)))),
    arrayCatProb, arrayCatCount / classCatCount,

    details, IF(showDetails,
      HSTACK(
        VSTACK("Features", TOCOL(arrayHeaders)),
        VSTACK("Class", TOCOL(classCat)),
        VSTACK("Class Probability", TOCOL(classCatProb)),
        VSTACK("Features by Class Probabilities", arrayCat),
        VSTACK(classCat, arrayCatProb)
      ),
      HSTACK(TOCOL(arrayHeaders), TOCOL(classCat), TOCOL(classCatProb), arrayCat, arrayCatProb)
    ),

    IFNA(details,"")
  )
)

Predict

Predict takes the output from Fit as arrayFit. It calculates the class probabilities pdictProbArray for a given record in arrayPredict. The highest probability predictRow for each record is chosen. A look-up of the classification for each record is placed in predictClass.

dcrML.NaiveBayes.Predict
=LAMBDA(arrayFit, arrayPredict, [showDetails],
  LET(
    arrayHeaders, FILTER(CHOOSECOLS(arrayFit,1),CHOOSECOLS(arrayFit,1)<>""),
    classCat, FILTER(CHOOSECOLS(arrayFit,2),CHOOSECOLS(arrayFit,2)<>""),
    classCatProb, FILTER(CHOOSECOLS(arrayFit,3),CHOOSECOLS(arrayFit,3)<>""),
    arrayCat, FILTER(CHOOSECOLS(arrayFit,4),CHOOSECOLS(arrayFit,4)<>""),
    arrayCatProb, DROP(arrayFit,,4),
    predictCat, TOROW(arrayHeaders) & ":" & arrayPredict,

    predictRow, BYROW(predictCat, LAMBDA(pdictRow,
      LET(
        pdictFilter, arrayCat = pdictRow,
        pdictMatrix, N(BYROW(pdictFilter, LAMBDA(pRow, OR(pRow)))),
        pdictMatrixProb, arrayCatProb * pdictMatrix,
        pdictMatrixOned, IF(pdictMatrixProb = 0, 1, pdictMatrixProb),
        pdictProbArray, BYCOL(pdictMatrixOned, LAMBDA(pCol, PRODUCT(pCol))) * TOROW(classCatProb),
        pdictSum, SUM(pdictProbArray),
        pdictMax, MAX(pdictProbArray),
        pdictProb, pdictMax/pdictSum,
        pdictIdx, MATCH(pdictMax, pdictProbArray, 0),

        pdictIdx + pdictProb
      )
    )),

    predictIdx, FLOOR.MATH(predictRow),
    predictProb, predictRow - predictIdx,
    predictClass, INDEX(TOROW(classCat),1,predictIdx),

    details, IF(showDetails,
      HSTACK(
        VSTACK("Class", predictClass),
        VSTACK("Idx", predictIdx),
        VSTACK("Probability", predictProb)
      ),
      predictClass
    ),

    IFNA(details, "")
  )
)

Seeing Naïve Bayes in Action

The example data comes from the Hong Zhou book. Suppose you have this set of historical records. It details fictitiously lung tissue sample features and its related cancer diagnosis.

We can use the above Lambda Naïve Bayes implementation to calculate the Fit data.

The size of the Fit data will depend on the number of features and classifications of the input records.

Now, suppose you need to predict the class of a given sample record.

A simply visual comparison with the historical record, we expect the prediction to be a Lung_Cancer "Yes". So we use the Predict function in the following manner, the input Fit data is on the left, the records to be predicted are in the middle, and the prediction table is on the right.

Here we see Naïve Bayes predicting a "Yes" as expected.

Questions and Answers

If we can predict manually, why do you need Naïve Bayes?

I would give two reasons. Naïve Bayes learns from historical records on how we would manually classify. This method automates the otherwise mundane task especially if you have many/frequent records. Another reason is that sometimes there are many more features that weigh in on the classification. The earlier example had only 3 features. Classification is therefore trivial. But if you had say 16 features, evaluation becomes a considerable effort. Training another person to perform the evaluation would take time and effort. Naïve Bayes learns from your classification behaviour and applies to new records.

Does this work with categories?

Yes and that is the beauty with Naïve Bayes. In the real world we tend to categories features with words - tall, short, heavy, light, big, medium, small, hard, soft, pink, blue, brown, etc. Naïve Bayes work well with these. Other methods would require translating the categorial data to some scaled nominal value.  

But categories can be subjective?

That is true. But categories can be reliable if proper calibration is applied, example training and refresher, tool aids like gauges, templates, comparison charts, etc.

What happens if the variables are not independent?

Bayes theorem assumes variable independence. In this condition, the equation shown earlier is true, and that makes the calculation simpler. However if the condition is not met, the equation does not hold.

Conclusion

When I first heard of Bayes theorem and Naïve Bayes, they gave me anxiety. But the best way forward was to face my fears head on. Having worked on the implementation, I gain confidence to work on more probabilities. Understanding demystifies the fog. 

Epilogue

This will be my last post on data mining for now. I am still exploring Excel Lambda. Among these are numerical analysis and gradient descent. If I find a simple way forward, I will be back with topics on logistics regression, neural networks, etc.

Until then, go forth and make great your DC-DEN!

Comments