Posts

ANOVA - Comparing averages of multiple data sets

Image
In an earlier post, we used the Two Sample Mean Test to compare the mean of two samples (duh). If you have 3 samples you could use the same test, and compare between pairs of samples, i.e. sample 1 & sample 2, sample 2 & sample 3, and sample 1 & sample 3. The problem arises when you have many more sample sets. Running this test means keeping track of the permutations. This approach is wastefully repetitive and you are likely to make mistakes. Also, sometimes, you only want to know if there is at least one sample that is different from the rest. Scientific or medical break throughs are rare. In most experiments and trials, there will be no change/difference. A single test that tells you that there is no difference between the samples allows you to move on.  Is there a better method to compare the mean of multiple samples? Enter ANOVA Analysis Of Variance (ANOVA) is a statistical method used to compare the means between samples. It tells you if there's any significant d

Python in Excel coming soon! Will you be onboarding?

Image
If you have not heard, Microsoft announced that it will be releasing Python in Excel . Python commands and libraries will be accessible from within Excel. No installation or add-ons required. The various Python libraries including charting capabilities will be available for you. Python commands will be executed in Microsoft Cloud. Security will be ensured. The screenshot below shows Python commands entered from within an Excel cell, with Python referencing data from the worksheet, almost as if Python is native in Excel. Advance visualisation will be available. Machine learning capabilities will at your finger tips. Will it benefit you? Short answer : No. The majority of Excel usage would not need these Python libraries and charts. It will not entice  non-Python users over to Python. Any workbook created with Python scripts would still require a Python user to maintain changes. So in a company, an employee who introduces Python in Excel, would not enjoy the reception. Any workbook creat

When it is not Normal... The Mann-Whitney Test

Image
The Mann-Whitney Test Mann-Whitney test helps you compare two sets of data when they are not normally distributed. I would use Mann-Whitney test only after I confirm using the  Anderson-Darling test . I should add that I run a box-plot first before running the AD test. I used to think that Mann-Whitney compares the medians of two data sets. But in the process of implementing Mann-Whitney in Excel LAMBDA, I found out that Mann-Whitney test compares the mean ranks, in doing so, it determines if the two data set came from the same population. Step 1: Group the two samples together and sort the data in ascending order, but retain their sample origin. Step 2: Rank the combined data. Step 3: Separate the data back into the two samples. Sum the ranks for each sample. Step 4: Compute the test statistics U `U_1 = R_1 - (n_1(n_1+1))/2` and  `U_2 = R_2 - (n_2(n_2+1))/2` Step 5: Choose the small U to calculate the equivalent z-statistics. `z = (U - m_U) / sigma_U` where `m_U` and `sigma_U` are the

Is Your Data Normal? The Shapiro-Wilk Test

Image
  Shapiro-Wilk and Anderson Darling Tests The Shapiro-Wilk test is another statistical test to assess the normality of a dataset. Just as Anderson-Darling test , both are goodness-of-fit test. Why would we need another normality test? Let me put some points forward: The Shapiro-Wilk SW test is specifically a test for normality. The Anderson-Darling AD test can be applied to other distributions, not just normal distribution. That said, my AD implementation  was specific for normal distribution. AD test is more sensitive to deviations in the tails of the distribution, while SW test is sensitive to deviations across the entire distribution. Both AD and SW can be used to calculate p-values, but the p-value will not be comparable. This means there can be instances where AD and SW disagree . From implementation perspective, I find AD test easier to write than the SW test (a lot more steps), but AD test was much more difficult to understand compared to SW test. So, which test should you use:

Is Your Data Normal? The Anderson-Darling Test

Image
Have a look at the Excel Histogram chart above based on a given set of data. Does it look normally distributed? How could you find out? Well, you could calculate the kurtosis and skewness, e.g. using  Descriptive Statistics , or generate a BoxPlot. But still, how can you know with certainty? Descriptive Statistics showing Kurtosis and Skewness BoxPlot of Sample Data Why is Checking for Normality Important? In previous blogs on One Sample Mean Test and Two Sample Mean Test , I assumed the given data is normally distributed. If the sample is normally distributed then choosing a parametric test like t-test is applicable since we are using the mean (central tendency) and standard deviation (spread) to calculate the t-statistics . Assuming normal distribution is a simplification. It is therefore only to justify using parametric tests. However, mean and standard deviation should not be used to describe a sample that is not normally distr

Extending Two Sample Mean Test for Arrays

Image
In the previous blog, I implemented a Two Sample Mean hypothesis test with Excel LAMBDA. In this blog, I will create a new function that takes in two sample data arrays directly, calculate the necessary statistics, and then reuse the Two Sample Mean function to perform the hypothesis test. This approach will ensure the same result. Parameter Changes In the earlier implementation, the dcrMean.Two.TTest input parameters are: dcrMean.Two.TTest =LAMBDA(sample_mean_1, sample_stdev_1, sample_size_1, sample_mean_2, sample_stdev_2, sample_size_2, [tail], [show_details], To take in two arrays, I will create a new function dcrMean.Two.TTest.Array with input parameters as follows: dcrMean.Two.TTest.Array =LAMBDA(array_1, array_2, [tail], [show_details], Within this new function, the mean and standard deviation of the two input arrays will be pre-calculated before passing the values into the hypothesis test. The implementation will be like this: =LAMBDA(array_1, array_2, [tail], [show_

Creating Two Sample Mean Test with LAMBDA

Image
You received glass window samples from two different vendors. You wish to check if the thickness of the samples are the same. Based on the samples, you obtain the following data: Vendor Sample Sample 1 Sample 2 Mean (cm) 2.0 2.3 Standard Deviation (cm) 0.2 0.3 Number of Samples 15 20 At first glance sample 2 is 0.3cm thicker than sample 1. But are their thickness significantly different? To make a comparison you would need a perform a Two Sample Mean test . Two Sample Mean Test A two sample mean test compares two sample distribution means against each other. It differs from the one sample mean test that compares against a target value. We write the Null Hypothesis as the means of sample 1 and of sample 2 are equal. `H_0: mu_1 = mu_2` And the Alternative Hypothesis as the mean of sample 1 and of sample 2 are not equal. `H_1: mu_1 != mu_2` We could also test