Posts

Is Your Data Normal? The Shapiro-Wilk Test

Image
  Shapiro-Wilk and Anderson Darling Tests The Shapiro-Wilk test is another statistical test to assess the normality of a dataset. Just as Anderson-Darling test , both are goodness-of-fit test. Why would we need another normality test? Let me put some points forward: The Shapiro-Wilk SW test is specifically a test for normality. The Anderson-Darling AD test can be applied to other distributions, not just normal distribution. That said, my AD implementation  was specific for normal distribution. AD test is more sensitive to deviations in the tails of the distribution, while SW test is sensitive to deviations across the entire distribution. Both AD and SW can be used to calculate p-values, but the p-value will not be comparable. This means there can be instances where AD and SW disagree . From implementation perspective, I find AD test easier to write than the SW test (a lot more steps), but AD test was much more difficult to understand compared to SW test. So, which test should you use:

Is Your Data Normal? The Anderson-Darling Test

Image
Have a look at the Excel Histogram chart above based on a given set of data. Does it look normally distributed? How could you find out? Well, you could calculate the kurtosis and skewness, e.g. using  Descriptive Statistics , or generate a BoxPlot. But still, how can you know with certainty? Descriptive Statistics showing Kurtosis and Skewness BoxPlot of Sample Data Why is Checking for Normality Important? In previous blogs on One Sample Mean Test and Two Sample Mean Test , I assumed the given data is normally distributed. If the sample is normally distributed then choosing a parametric test like t-test is applicable since we are using the mean (central tendency) and standard deviation (spread) to calculate the t-statistics . Assuming normal distribution is a simplification. It is therefore only to justify using parametric tests. However, mean and standard deviation should not be used to describe a sample that is not normally distr

Extending Two Sample Mean Test for Arrays

Image
In the previous blog, I implemented a Two Sample Mean hypothesis test with Excel LAMBDA. In this blog, I will create a new function that takes in two sample data arrays directly, calculate the necessary statistics, and then reuse the Two Sample Mean function to perform the hypothesis test. This approach will ensure the same result. Parameter Changes In the earlier implementation, the dcrMean.Two.TTest input parameters are: dcrMean.Two.TTest =LAMBDA(sample_mean_1, sample_stdev_1, sample_size_1, sample_mean_2, sample_stdev_2, sample_size_2, [tail], [show_details], To take in two arrays, I will create a new function dcrMean.Two.TTest.Array with input parameters as follows: dcrMean.Two.TTest.Array =LAMBDA(array_1, array_2, [tail], [show_details], Within this new function, the mean and standard deviation of the two input arrays will be pre-calculated before passing the values into the hypothesis test. The implementation will be like this: =LAMBDA(array_1, array_2, [tail], [show_

Creating Two Sample Mean Test with LAMBDA

Image
You received glass window samples from two different vendors. You wish to check if the thickness of the samples are the same. Based on the samples, you obtain the following data: Vendor Sample Sample 1 Sample 2 Mean (cm) 2.0 2.3 Standard Deviation (cm) 0.2 0.3 Number of Samples 15 20 At first glance sample 2 is 0.3cm thicker than sample 1. But are their thickness significantly different? To make a comparison you would need a perform a Two Sample Mean test . Two Sample Mean Test A two sample mean test compares two sample distribution means against each other. It differs from the one sample mean test that compares against a target value. We write the Null Hypothesis as the means of sample 1 and of sample 2 are equal. `H_0: mu_1 = mu_2` And the Alternative Hypothesis as the mean of sample 1 and of sample 2 are not equal. `H_1: mu_1 != mu_2` We could also test

Extending One Sample Mean Test for Arrays

Image
In the previous blog, I implemented a One Sample Mean hypothesis test in Excel LAMBDA. When given an array of sample data, we firstly calculate the sample mean, standard deviation and size, before passing these into the function. The function compares a sample mean against an expected population mean. In this blog, we will create a new function to take the sample data array directly, calculate the necessary statistics, and then reuse the One Sample Mean function to perform the hypothesis test. This extension ensures the same result regardless of the approach used. Parameter Changes In the earlier One Sample Mean dcrMean.One.TTest the input parameters were defined as: dcrMean.One.TTest =LAMBDA(expected_mean, sample_mean, sample_stdev, sample_size, [tail], [show_details], To implement One Sample Mean test for array,  dcrMean.One.TTest.Array  input parameter will take the entire data array: dcrMean.One.TTest.Array =LAMBDA(expected_mean, array, [tail], [show_details], Within this

Excel Range vs Array

Image
As I worked more with LAMBDA, I discovered that Excel functions that take range parameters do not work with arrays . If a function's parameter mentions "range", you can be sure this will not work with arrays. Some examples of these functions are COUNTIF , COUNTIFS , AVERAGEIF , AVERAGEIFS , SUMIF , and SUMIFS . The other that I am aware of is RANK.AVG . What is an Array? A Range represents a cell, a row, a column, a selection of selections, or blocks of cells. Range will always have an address, for example B2:B5 is a range of values stored in a column. An Array is a set of values constructed using a formula. For example  ={1;2;3;4} is column of values. =VSTACK(1,2,3,4) is also a column of values. The FORMULATEXT function would return #N/A in a range cell that contains only values. On the other hand it returns the formula used to construct the array values. An array can be displayed on a spreadsheet. When displayed on the spreadsheet, the values can be referred by the

Creating One Sample Mean Test with LAMBDA

Image
Suppose you visited a village. The village elders claims to possess a secret elixir that makes their children grower taller! You are aware of the national 12-year old  average height is 150cm. And you could measure the village's 12-year old children's height. Plotting the village's children's height gives you a distribution like in the graph above. At first glance the village's children does seem taller than the national average. Is the village's children really taller? Is the distribution significantly different? One Sample Mean Test A one sample mean test compares a sample distribution's mean against a target value. It is similar to the one proportion test but the sample data is continuous, giving a normal distribution curve instead of a yes/no binary result. We write the Null Hypothesis as the sample mean equal to the population mean. `H_0: mu = mu_0` And the Alternative Hypothesis as the sample mean not equal to the population mean. `H_1: mu != mu_0` We