Posts

Out-of-the-Box Part 2 - One Sample Mean Testing with Z.TEST

Image
Previously we saw the One Sample Mean test implemented using LAMBDA  by calculating the test statistics. Excel provides something similar using the  Z.TEST  function. The Z.TEST function compares the average of a sample if it is statistically less than or equal to a test value. The formula is stated as: Z.TEST(array, x, [sigma]) where array is the range of data to be tested against x x is the value of test. sigma is optional. If the population standard deviation is known, put it here. Otherwise leave it empty, and Excel will use the sample standard deviation instead. In effect you are testing to see if the average of the array is less than or equal to x `bar(array) < x` Comparing for Less Than or Equal To Observe the box plot of 5 sample distributions above. We want to check if the distributions averages are less than or equal to the value of 50 . We see samples A and B less than 50, sample C around 50, and samples D and E more than 50. NOTE: Where possible do box plots of the

Out-of-the-Box Part 1 - Proportion Testing with CHISQ.TEST

Image
So Excel Lambda is great! You can create custom functions that are not included in Excel. And we have seen in this blog how you could implement various statistical hypothesis testing .  But then what are those Excel TEST functions for? What hypothesis testing can I do with these Out-of-the-box TEST functions? In the next few posts, I will describe some common hypothesis testing you could do: Proportion Test with  CHISQ.TEST One Sample Mean Test with  Z.TEST Two Sample Mean Test with  T.TEST Variance Test with  F.TEST Excel's Chi Squared Test for One Proportion Testing In Excel's documentation, CHISQ.TEST is described as a test for independence . It does this by doing a goodness-to-fit test on how well the data matches the expected. This means we can also use it for one proportion testing. To use CHISQ.TEST we will need to compare the sample proportion against the expected proportion. But unlike DC-Den's One Proportion Testing  where we only need to specify the expected

Which Machine Learning Algorithm To Use?

Image
Terminologies We learnt a few machine learning  terminologies  and algorithms in this blog. Supervised  means we rely on labelled training data. It is task driven to identify a goal. Unsupervised  means unlabeled training data. It is data driven to identify a pattern. Classification arranges data into classes/categories using a labeled dataset. Regression develops a model to predict continuous numerical values. Clustering  separates an unlabeled dataset into clusters/groups of similar objects. Classification is a supervised learning algorithm, while Clustering is an unsupervised algorithm. Regression is considered supervised learning because the model is trained using both the input features and output labels - which can be numerical values. I will mention here that two other unsupervised approaches are:  Association , to identify underlying relationships, and Dimension Reduction , to reduce the number dimensions/features to make calculations simpler. I did not cover any methods on a

It's Probably Correct - Classifying with Naïve Bayes

Image
Naïve Bayes is a categorial probabilistic supervised classification. You may already be familiar with the terminology  supervised classification , so I will not repeat it here. Naïve Bayes doesn't require numerical values. It relies on categories or labels only. It is probabilistic because it uses probabilities to calculate the classification. It calculates the classification probabilities based on given records. The more consistent the data (repeatable patterns), the stronger the probability of the classification.  Note : This implementation and example here follows closely from Learn Data Mining Through Excel by Hong Zhou . Great book! Bayes Theorem Naïve Bayes is based on Bayes theorem most famously written as below: `P(y|x) = (P(x|y)*P(y)) / (P(x))` where: `P(y|x)` is the probability of `y` given `x` `P(x|y)` probability of `x` given `y` `P(y)` and `P(x)` are the probabilities of y and x respectively. With multi-independent variables `(x_1, x_2, ..., x_n)` the equation would be

Who are your neighbours? Classification with KNN

Image
The closer an observation to a class, the more likely it belongs to that class. KNN or K-Nearest Neighbour is a non-parametric supervised learning classification algorithm. It uses proximity to make classifications about the grouping of an observation point. How do you measure proximity? You may remember Pythagoras' theorem from your school days. The distance between two points is the square root of the sum of the squares of the sides. This would be one way to define proximity. But we could define proximity differently. Given two points (`x_1`, `x_2`, `x_3`) and (`y_1`, `y_2`, `y_3`)  Euclidean distance `= sqrt( (y_1-x_1)^2 + (y_2-x_2)^2 + (y_3-x_3)^2` Manhattan distance: `= |y_1-x_1| + |y_2-x_2| + |y_3-x_3|` Chebyshev distance: `= max(|y_1-x_1|, |y_2-x_2|, |y_3-x_3|)` Each of these definitions have their pros and cons. For our implementation, I will be using Euclidean distance. However if you wish to reduce the computation complexity, you might want to try Manhattan or Chebyshev

Linear Discriminant Analysis LDA - Using Linear regression for classification

Image
Linear Discriminant Analysis LDA uses linear regression to supervise the classification of data. Essentially you assign each class a numerical value. Then use  linear regression method  to calculate the projection of your observations to the assigned numerical values. Finally you calculate the thresholds to distinguish between classes. Essentially LDA attempts to find the best linear function that separates your data points into distinct classes. The above diagram illustrates this idea. Implementing LDA using LAMBDA Fit Steps in implementing LDA's Fit : 1. Find the distinct classes and assign each with an arbitrary value - UNIQUE and SEQUENCE . 2. Designate each observation with the arbitrary assigned value depending on its class - XLOOKUP . 3. Find the linear regression coefficients for this observations - dcrML.Linear.Fit . 4. Project each observation on the linear regression - dcrML.Linear.Predict . 5. Find the threshold of each class - classCutOff  from the spread of each re

KMeans clustering - Finding your centre

Image
KMeans clustering is a method to partition your observations into k clusters with k centroids that describes the centre of each cluster . When given a new observation, it is part of a cluster if it is closest to the centroid of that cluster. The diagram above illustrates the k-means clustering concept. The KMeans approach starts by deciding the number of clusters you wish. Then you estimate where the centroids of each cluster might be located. The distance of each observation to each centroid is calculated. Then each observation is re-clustered to the closest centroid. For each new cluster, we re-calculate a new centroid by averaging the cluster data by each feature. We repeat this cycle until no further refinement is achieved. Since Excel LAMBDA does not have iterative loops, a recursive approach will be used. Implementing KMeans clustering in LAMBDA With k-means clustering we implement Predict  before  Fit . Predict Predict takes a list of observations array and a list of centr