DC-Den

Posts

Shuffling your data

- April 06, 2025

🂡🂢🂣🂤🂥 In DC-ML's Supporting Functions , I created an Excel LAMBDA function, SelectData , to split data into training and testing datasets. This assumes data is randomly collected. If it isn't shuffling before splitting can increase confidence in the results. In this blog, I’ll show how to shuffle your dataset. Why Shuffle your data Splitting data helps prevent overfitting by using the test dataset to evaluate model accuracy on unseen data. Shuffling data prevents models from learning ordered patterns that could introduce bias. By randomizing the data, the model generalizes better to unseen data, improving performance and accuracy. ShuffleData The ShuffleData function is simple. It generates a random array ( randArray ) matching the dataset's rows and sorts the dataset using SORTBY based on this array. =LAMBDA(array, [headers], LET( randArray, RANDARRAY(ROWS(array), 1), sortArray, SORTBY(array, randArray), IF(ISOMITTED(headers), sortArray, VS...

Out-of-the-Box Part 4 - Variance Testing with F.TEST

- January 05, 2025

We wrap up our "Out of the Box" series with the F.TEST function, used to check if the variances of two samples are equal. Variance is simply the square of the standard deviation `sigma^2`, which measures how spread out data is around the mean. Since variance and standard deviation are mathematically linked, we use the terms interchangeably. The F.TEST formula is simple: F.TEST(array1, array2) where array1 and array2 are your data ranges, which don’t need to be the same size. Visualizing Variance with Box Plots In the previous post, we looked at box plots for five sample distributions. From the box plots, you might think that samples A, B, and D have the same spread , while C and E are different . However, these box plots only show quartiles (25% to 75% of the data) and may not accurately reflect variance. Using the F.TEST formula, we find that while sample A's variance differs from sample B's, sample C's variance is similar to B's. As it turns out sample...

Happy New Year 2025!

- January 01, 2025

🎉Happy new year, 2025! 🎊 It’s been a stressful and tumultuous year for me, but I’m grateful to have made it through one of the most challenging times of my life. While I’m not completely out of the woods, I thank God for protecting me and keeping me safe in him. Looking forward, I plan to continue writing about Excel and LAMBDA, not only for my own growth but to explore niche topics that interest me. Off the blog, I’ve been diving deeper into numerical analysis with Excel. Once I’m satisfied with my approach, I will share my insights that hopefully could be valuable to others. Wishing you all a Happy New Year filled with peace. God bless!

Out-of-the-Box Part 3 - Two Sample Mean Testing with T.TEST

- December 01, 2024

Excel's T.TEST function is commonly used to compare the means of two sample data sets. According to its documentation , the function is designed to: ...to determine whether two samples are likely to have come from the same two underlying populations that have the same mean . In order words, T.TEST tests whether the average (mean) values of two samples are significantly different. However, it does not assess whether one sample's average is specifically greater than or less than the other. If your goal is to test if the mean of one sample is greater than or lesser than the other, T.TEST may not be the right tool. Instead, you might consider alternatives like DC-DEN's two sample mean test , which is specifically designed to compare means in a directional (greater, lesser or equal) context. Understanding the T.TEST formula in Excel The T.TEST function in Excel compares the means of two data sets to determine if they are statistically different. The formula is as fo...

Out-of-the-Box Part 2 - One Sample Mean Testing with Z.TEST

- November 03, 2024

Previously we saw the One Sample Mean test implemented using LAMBDA by calculating the test statistics. Excel provides something similar using the Z.TEST function. The Z.TEST function compares the average of a sample if it is statistically less than or equal to a test value. The formula is stated as: Z.TEST(array, x, [sigma]) where array is the range of data to be tested against x x is the value of test. sigma is optional. If the population standard deviation is known, put it here. Otherwise leave it empty, and Excel will use the sample standard deviation instead. In effect you are testing to see if the average of the array is less than or equal to x `bar(array) < x` Comparing for Less Than or Equal To Observe the box plot of 5 sample distributions above. We want to check if the distributions averages are less than or equal to the value of 50 . We see samples A and B less than 50, sample C around 50, and samples D and E more than 50. NOTE: Where possible do box...

Out-of-the-Box Part 1 - Proportion Testing with CHISQ.TEST

- August 04, 2024

So Excel Lambda is great! You can create custom functions that are not included in Excel. And we have seen in this blog how you could implement various statistical hypothesis testing . But then what are those Excel TEST functions for? What hypothesis testing can I do with these Out-of-the-box TEST functions? In the next few posts, I will describe some common hypothesis testing you could do: Proportion Test with CHISQ.TEST One Sample Mean Test with Z.TEST Two Sample Mean Test with T.TEST Variance Test with F.TEST Excel's Chi Squared Test for One Proportion Testing In Excel's documentation, CHISQ.TEST is described as a test for independence . It does this by doing a goodness-to-fit test on how well the data matches the expected. This means we can also use it for one proportion testing. To use CHISQ.TEST we will need to compare the sample proportion against the expected proportion. But unlike DC-Den's One Proportion Testing where we only need to spe...

Which Machine Learning Algorithm To Use?

- July 07, 2024

Terminologies We learnt a few machine learning terminologies and algorithms in this blog. Supervised means we rely on labelled training data. It is task driven to identify a goal. Unsupervised means unlabeled training data. It is data driven to identify a pattern. Classification arranges data into classes/categories using a labeled dataset. Regression develops a model to predict continuous numerical values. Clustering separates an unlabeled dataset into clusters/groups of similar objects. Classification is a supervised learning algorithm, while Clustering is an unsupervised algorithm. Regression is considered supervised learning because the model is trained using both the input features and output labels - which can be numerical values. I will mention here that two other unsupervised approaches are: Association , to identify underlying relationships, and Dimension Reduction , to reduce the number dimensions/features to make calculations simpler. I d...