Posts

Showing posts with the label Supervised Learning

It's Probably Correct - Classifying with Naïve Bayes

Image
Naïve Bayes is a categorial probabilistic supervised classification. You may already be familiar with the terminology  supervised classification , so I will not repeat it here. Naïve Bayes doesn't require numerical values. It relies on categories or labels only. It is probabilistic because it uses probabilities to calculate the classification. It calculates the classification probabilities based on given records. The more consistent the data (repeatable patterns), the stronger the probability of the classification.  Note : This implementation and example here follows closely from Learn Data Mining Through Excel by Hong Zhou . Great book! Bayes Theorem Naïve Bayes is based on Bayes theorem most famously written as below: `P(y|x) = (P(x|y)*P(y)) / (P(x))` where: `P(y|x)` is the probability of `y` given `x` `P(x|y)` probability of `x` given `y` `P(y)` and `P(x)` are the probabilities of y and x respectively. With multi-independent variables `(x_1, x_2, ..., x_n)` the equation would be

Who are your neighbours? Classification with KNN

Image
The closer an observation to a class, the more likely it belongs to that class. KNN or K-Nearest Neighbour is a non-parametric supervised learning classification algorithm. It uses proximity to make classifications about the grouping of an observation point. How do you measure proximity? You may remember Pythagoras' theorem from your school days. The distance between two points is the square root of the sum of the squares of the sides. This would be one way to define proximity. But we could define proximity differently. Given two points (`x_1`, `x_2`, `x_3`) and (`y_1`, `y_2`, `y_3`)  Euclidean distance `= sqrt( (y_1-x_1)^2 + (y_2-x_2)^2 + (y_3-x_3)^2` Manhattan distance: `= |y_1-x_1| + |y_2-x_2| + |y_3-x_3|` Chebyshev distance: `= max(|y_1-x_1|, |y_2-x_2|, |y_3-x_3|)` Each of these definitions have their pros and cons. For our implementation, I will be using Euclidean distance. However if you wish to reduce the computation complexity, you might want to try Manhattan or Chebyshev

Linear Discriminant Analysis LDA - Using Linear regression for classification

Image
Linear Discriminant Analysis LDA uses linear regression to supervise the classification of data. Essentially you assign each class a numerical value. Then use  linear regression method  to calculate the projection of your observations to the assigned numerical values. Finally you calculate the thresholds to distinguish between classes. Essentially LDA attempts to find the best linear function that separates your data points into distinct classes. The above diagram illustrates this idea. Implementing LDA using LAMBDA Fit Steps in implementing LDA's Fit : 1. Find the distinct classes and assign each with an arbitrary value - UNIQUE and SEQUENCE . 2. Designate each observation with the arbitrary assigned value depending on its class - XLOOKUP . 3. Find the linear regression coefficients for this observations - dcrML.Linear.Fit . 4. Project each observation on the linear regression - dcrML.Linear.Predict . 5. Find the threshold of each class - classCutOff  from the spread of each re

Excuse me. Some Terminologies: Classification vs Clustering vs Regression

This is a short post to describe some terms used in data mining. Classification arranges data into classes/categories using a labeled dataset. Clustering separates an unlabeled dataset into clusters/groups of similar objects. Regression develops a model to predict continuous numerical values.  Classification is a supervised learning algorithm, while Clustering is an unsupervised algorithm. Regression is considered supervised learning because the model is trained using both the input features and output labels - which can be numerical values. Supervised means we rely on labelled training data. Unsupervised means unlabeled training data. That's all for now from DC-DEN !

Linear Regression: Why you should reinvent Excel's LINEST?

Image
In the previous article on  Linear Regression , I mentioned Excel's LINEST function. But if you tried using the returned coefficients, you may notice something peculiar. The order of the returned linear coefficients is in the reverse order of the input data. LINEST documents: The equation for the line is: `y = m_1x_1 + m_2x_2 + ... + m_nx_n+ b` if there are multiple ranges of x-values, where the dependent y-values are a function of the independent x-values. The m-values are coefficients corresponding to each x-value, and b is a constant value. Note that y, x, and m can be vectors. The array that the LINEST function returns is `{m_n, m_(n-1), ..., m_1, b}`.  The input is in the order 1st, 2nd, 3rd, ... but the returned coefficients are in the reverse. And if you were to use the coefficients to predict y for a given `x_1, x_2, x_3, ...`, you would either swap the x-s around or the coefficients around. This isn't intuitive. For this reason you should reinvent LINEST . The inten