Is Your Data Normal? The Anderson-Darling Test
Have a look at the Excel Histogram chart above based on a given set of data. Does it look normally distributed? How could you find out?
Descriptive Statistics showing Kurtosis and Skewness | BoxPlot of Sample Data |
Why is Checking for Normality Important?
In previous blogs on One Sample Mean Test and Two Sample Mean Test, I assumed the given data is normally distributed.
If the sample is normally distributed then choosing a parametric test like t-test is applicable since we are using the mean (central tendency) and standard deviation (spread) to calculate the t-statistics. Assuming normal distribution is a simplification. It is therefore only to justify using parametric tests.
However, mean and standard deviation should not be used to describe a sample that is not normally distributed if the distribution (shape) is not defined. Parametric test like t-test is not applicable. Instead, a non-parametric model would be needed to compare data. I will be covering Mann Whitney and Kruskal Wallis tests in future blogs.
But for now, how do we check if a t-test qualifies to be used with a data distribution? How do we check if a data is normally distributed?
Enter Anderson-Darling Normality Test
Anderson-Darling normality test is a commonly used test to check if a set of data is normally distributed (goodness-of-fit test). Unfortunately the Anderson-Darling normality test is not found in Excel.
The Anderson-Darling test measures the discrepancy between the observed data and the expected cumulative distribution function of a normal distribution. The test statistics is the squared difference between the observed and the expected values, giving more weight to the discrepancy in the tails of the distribution. In simpler terms, does the observed data distribution come from a normal distribution?
The test statistics `A` is used to assess if the given ordered data `{Y_1 < ... < Y_n}` comes from the normal cumulative distribution function CDF `F`
`A^2 = -n - S`
where
`S = sum_(i=1)^n (2i-1)/n [ln(F(Y_i)) + ln(1 - F(Y_(n+1-i)))]`
`n =` size of data
The test statistics `A` is then used to calculate the p-value based on the following table. As I understand it, the values in the table are based the theoretical normal cumulative distribution function.
`A^**` | `p-value` |
`A^** <= 0.2` | `1 - exp[-13.436 + 101.14A^** - 223.73(A^**)^2]` |
`0.2 < A^** <= 0.34` | `1 - exp[-8.318 + 42.796A^** - 59.938(A^**)^2]` |
`0.34 < A^** < 0.6` | `exp[0.9177 - 4.279A^** - 1.38(A^**)^2]` |
`A^** > 0.6` | `exp[1.2937 - 5.709A^** - 0.0186(A^**)^2]` |
where
`A^** = A (1 + 0.75/n + 2.25/n^2)`
The mathematics behind AD test statistics calculation is beyond the scope of this article. Even taking it at face value, there are still a lot of equations and conditions. This will be the most complex Excel LAMBDA implementation yet. But it is not complicated.
A side note: There's a big difference between complexity and complication.
Intermediate Steps
To start off, I will calculate the mean, standard deviation and size of the data. I will then sort the data, create an increasing sequence to pre-calculate `2i-1` for each sequence.
mean, AVERAGE(array), stdev, STDEV.S(array), size, COUNT(array), array_X, FILTER(TOCOL(array), TOCOL(array)<>""), array_Ysorted, SORT(array_X,,1), seq, SEQUENCE(size,1,1,1), seq_2i_minus_1, 2*seq - 1,
For each term, I calculate the normal cumulative distribution terms, and then the intermediate terms. All this is leading up to the `S` sum function.
CDF_Yi, NORM.S.DIST((array_Ysorted-mean)/stdev,TRUE), One_minus_CDF_Yi, 1 - CDF_Yi, One_minus_CDF_Yi_reversed, INDEX(One_minus_CDF_Yi, SEQUENCE(size, 1, size, -1)), ln_F1_plus_F2, LN(CDF_Yi) + LN(One_minus_CDF_Yi_reversed), Si_term, seq_2i_minus_1 * ln_F1_plus_F2,
With these terms calculated, we can now calculate the `S` sum function, `A` value, and `A^**` value.
Si_sum, SUM(Si_term), AD_value, -size -Si_sum/size, AD_star, AD_value * (1 + 0.75/size + 2.25/size^2),
Depending on the `A^**` value, we look up the table for the correct p-value equation to use.
Putting It Together
Congratulations for getting here. We are done with the heavy lifting. Below is the final implementation of the Anderson-Darling normality test in LAMBDA.
dcrNormality.AndersonDarling.Test =LAMBDA(array, [show_details], LET(show_details, IF(ISOMITTED(show_details), FALSE, show_details), mean, AVERAGE(array), stdev, STDEV.S(array), size, COUNT(array), array_X, FILTER(TOCOL(array), TOCOL(array)<>""), array_Ysorted, SORT(array_X,,1), seq, SEQUENCE(size,1,1,1), seq_2i_minus_1, 2*seq - 1, CDF_Yi, NORM.S.DIST((array_Ysorted-mean)/stdev,TRUE), One_minus_CDF_Yi, 1 - CDF_Yi, One_minus_CDF_Yi_reversed, INDEX(One_minus_CDF_Yi, SEQUENCE(size, 1, size, -1)), ln_F1_plus_F2, LN(CDF_Yi) + LN(One_minus_CDF_Yi_reversed), Si_term, seq_2i_minus_1 * ln_F1_plus_F2, Si_sum, SUM(Si_term), AD_value, -size -Si_sum/size, AD_star, AD_value * (1 + 0.75/size + 2.25/size^2), pvalue1, 1 - EXP(-13.436 + 101.14*AD_star - 223.73*AD_star^2), pvalue2, 1 - EXP(-8.318 + 42.796*AD_star - 59.938*AD_star^2), pvalue3, EXP(0.9177 - 4.279*AD_star - 1.38*AD_star^2), pvalue4, EXP(1.2937 - 5.709*AD_star - 0.0186*AD_star^2), pvalue, IF(AD_star <= 0.2, pvalue1, IF(AD_star <= 0.34, pvalue2, IF(AD_star < 0.6, pvalue3, pvalue4))), AD_summary, VSTACK( HSTACK("Si_sum", Si_sum), HSTACK("AD_value", AD_value), HSTACK("AD_star", AD_star) ), pvalues_summary, VSTACK( HSTACK("pvalue1", pvalue1), HSTACK("pvalue2", pvalue2), HSTACK("pvalue3", pvalue3), HSTACK("pvalue4", pvalue4) ), details, VSTACK( HSTACK("X","Ysorted","i","2i-1","F1=F(Yi)","1-F(Yi)","F2=1-F(Yi)reversed","ln(F1)+ln(F2)", "Si_terms", "AD statistics", "", "pvalues_summary", "", "pvalue"), HSTACK( array_X, array_Ysorted, seq, seq_2i_minus_1, CDF_Yi, One_minus_CDF_Yi, One_minus_CDF_Yi_reversed, ln_F1_plus_F2, Si_term, AD_summary, pvalues_summary, pvalue ) ), IF(show_details, IFNA(details,""), pvalue) ) )
Seeing Anderson-Darling LAMBDA Implementation in Action
- X = input data, reprinted
- Ysorted = sorted input data
- i, 2i-1 = sequence
- F1 = F(Yi) = first ln term
- F2 = second ln term
- Si = each term in S
- Si_sum = sum of all the Si terms
- pvaluei = all the 4 pvalue terms pre-calculated.
- pvalue = final pvalue returned
Final Words
Phew! It is not as complicated as you might first imagine. Just take it at face value and proceed with the implementation. The tricky part is understanding the steps.
But you may ask, if Anderson-Darling is a goodness-of-fit test, how about using Excel's Chi-Square function instead? Good question. The reason I implemented Anderson-Darling test is because I have become habitual in using it. No special reason. Maybe in the future I will try packaging Excel's Chi-Square into a nice LAMBDA function.
In the next blog when we look at Shapiro-Wilk's normality test. It has more steps then AD, but I would still assure you that it isn't complicated. With your new Anderson Darling normality test, go check if your data is normal in DC-DEN!
Comments
Post a Comment