DC-Den

Posts

Who are your neighbours? Classification with KNN

- February 11, 2024

The closer an observation to a class, the more likely it belongs to that class. KNN or K-Nearest Neighbour is a non-parametric supervised learning classification algorithm. It uses proximity to make classifications about the grouping of an observation point. How do you measure proximity? You may remember Pythagoras' theorem from your school days. The distance between two points is the square root of the sum of the squares of the sides. This would be one way to define proximity. But we could define proximity differently. Given two points (`x_1`, `x_2`, `x_3`) and (`y_1`, `y_2`, `y_3`) Euclidean distance `= sqrt( (y_1-x_1)^2 + (y_2-x_2)^2 + (y_3-x_3)^2` Manhattan distance: `= |y_1-x_1| + |y_2-x_2| + |y_3-x_3|` Chebyshev distance: `= max(|y_1-x_1|, |y_2-x_2|, |y_3-x_3|)` Each of these definitions have their pros and cons. For our implementation, I will be using Euclidean distance. However if you wish to reduce the computation complexity, you might want to try Manhattan or Chebyshev

Linear Discriminant Analysis LDA - Using Linear regression for classification

- January 28, 2024

Linear Discriminant Analysis LDA uses linear regression to supervise the classification of data. Essentially you assign each class a numerical value. Then use linear regression method to calculate the projection of your observations to the assigned numerical values. Finally you calculate the thresholds to distinguish between classes. Essentially LDA attempts to find the best linear function that separates your data points into distinct classes. The above diagram illustrates this idea. Implementing LDA using LAMBDA Fit Steps in implementing LDA's Fit : 1. Find the distinct classes and assign each with an arbitrary value - UNIQUE and SEQUENCE . 2. Designate each observation with the arbitrary assigned value depending on its class - XLOOKUP . 3. Find the linear regression coefficients for this observations - dcrML.Linear.Fit . 4. Project each observation on the linear regression - dcrML.Linear.Predict . 5. Find the threshold of each class - classCutOff from the spread of each re

KMeans clustering - Finding your centre

- January 14, 2024

KMeans clustering is a method to partition your observations into k clusters with k centroids that describes the centre of each cluster . When given a new observation, it is part of a cluster if it is closest to the centroid of that cluster. The diagram above illustrates the k-means clustering concept. The KMeans approach starts by deciding the number of clusters you wish. Then you estimate where the centroids of each cluster might be located. The distance of each observation to each centroid is calculated. Then each observation is re-clustered to the closest centroid. For each new cluster, we re-calculate a new centroid by averaging the cluster data by each feature. We repeat this cycle until no further refinement is achieved. Since Excel LAMBDA does not have iterative loops, a recursive approach will be used. Implementing KMeans clustering in LAMBDA With k-means clustering we implement Predict before Fit . Predict Predict takes a list of observations array and a list of centr

Excuse me. Some Terminologies: Classification vs Clustering vs Regression

- January 07, 2024

This is a short post to describe some terms used in data mining. Classification arranges data into classes/categories using a labeled dataset. Clustering separates an unlabeled dataset into clusters/groups of similar objects. Regression develops a model to predict continuous numerical values. Classification is a supervised learning algorithm, while Clustering is an unsupervised algorithm. Regression is considered supervised learning because the model is trained using both the input features and output labels - which can be numerical values. Supervised means we rely on labelled training data. Unsupervised means unlabeled training data. That's all for now from DC-DEN !

Happy New Year 2024!

- December 31, 2023

Happy New Year🎈🎉! When DC-DEN blog started in 2021, I really wanted to write on data centre stuff. But I realised a number of things I would show/share is either confidential or proprietary. So in 2023, I decided to extend the blog about Excel LAMBDA in the area of statistics and data mining. I hope you are taking a good break and that you will come back next year for more from DC-DEN !

Recursion: LAMBDA calling LAMBDA

- December 17, 2023

Excel Lambda functions can be used to create custom, reusable functions by giving them a friendly name. This means a custom Lambda function could call itself. A recursive function is a function that calls itself and self-moves towards an exit condition. Without the exit condition, the function loops continuously unless externally terminated. Why use recursion? Some problems can be expressed in an elegant recursive structure. For example the n-th factorial is defined as: `n! = n (n-1) (n-2) ... 1` with the termination condition: `0! = 1` Recursion example using Lambda Before we start, Excel provides the FACT function to calculate the factorial of a number. You can use this to compare the results of your own. This is an example of a recursive function MyFactorial =LAMBDA(n, IF(n = 0, 1, n * MyFactorial(n-1) ) ) The IF function is used to determine the termination condition. If the termination condition has not been reached, decrement the variable by 1 and call the functi

Linear Regression: Why you should reinvent Excel's LINEST?

- December 03, 2023

In the previous article on Linear Regression , I mentioned Excel's LINEST function. But if you tried using the returned coefficients, you may notice something peculiar. The order of the returned linear coefficients is in the reverse order of the input data. LINEST documents: The equation for the line is: `y = m_1x_1 + m_2x_2 + ... + m_nx_n+ b` if there are multiple ranges of x-values, where the dependent y-values are a function of the independent x-values. The m-values are coefficients corresponding to each x-value, and b is a constant value. Note that y, x, and m can be vectors. The array that the LINEST function returns is `{m_n, m_(n-1), ..., m_1, b}`. The input is in the order 1st, 2nd, 3rd, ... but the returned coefficients are in the reverse. And if you were to use the coefficients to predict y for a given `x_1, x_2, x_3, ...`, you would either swap the x-s around or the coefficients around. This isn't intuitive. For this reason you should reinvent LINEST . The inten