Is Your Data Normal? The Shapiro-Wilk Test

 

Shapiro-Wilk and Anderson Darling Tests

The Shapiro-Wilk test is another statistical test to assess the normality of a dataset. Just as Anderson-Darling test, both are goodness-of-fit test.

Why would we need another normality test? Let me put some points forward:

  • The Shapiro-Wilk SW test is specifically a test for normality. The Anderson-Darling AD test can be applied to other distributions, not just normal distribution. That said, my AD implementation was specific for normal distribution.
  • AD test is more sensitive to deviations in the tails of the distribution, while SW test is sensitive to deviations across the entire distribution.
  • Both AD and SW can be used to calculate p-values, but the p-value will not be comparable. This means there can be instances where AD and SW disagree.
  • From implementation perspective, I find AD test easier to write than the SW test (a lot more steps), but AD test was much more difficult to understand compared to SW test.

So, which test should you use: Anderson-Darling or Shapiro-Wilk? I use Anderson-Darling test out of habit. There is no specific reason. I have read online that some prefer using more than one test just to be certain. Others have suggested that there is no harm in doing more than one test.

Side note: AD and SW tests are not the only normality tests available. If you are testing for other types of distributions, learning AD is useful as you will be able to implement for other distribution types.

Shapiro-Wilk Test

I followed the Shapiro-Wilk test explained in Real-Statistics.

Step 1: Sort the data in ascending order such that `x_1 < x_2 < ...< x_n`

Step 2: Define the values `m_1, m_2, ..., m_n` by

`m_i = NORM.S.INV((i-0.375)/(n+0.25))`

Step 3: Calculate the sum of squares

`m = sum_(i=1)^n (m_i)^2`

Step 4: Set `u = 1/sqrt(n)` and define the coefficients `a_1, ..., a_n` such that:

`a_n = -2.706056u^5 + 4.434685u^4 - 2.071190u^3 - 0.147981u^2 + 0.221157u + m_n m^(-0.5)`

`a_(n-1) = -3.582633u^5 + 5.682633u^4 - 1.752461u^3 - 0.293762u^2 + 0.042981u + m_(n-1) m^(-0.5)`

and the remaining:

`a_i = m_i / sqrt(epsilon)` for `2 < i < n-1`

`a_2 = -a_(n-1)` and `a_1 = -a_n`

where

`epsilon = (m - 2m_n^2 - 2m_(n-1)^2)/(1 - 2a_n^2 - 2a_(n-1)^2)`

Step 5: The Shapiro-Wilk test the null hypothesis assumes that the data `x_1, x_2, x_3, ..., x_n` came from a normally distributed population. It uses a test statistics defined as:

`W = (sum_(i=1)^n a_i x_((i)))^2 / (sum_(i=1)^n (x_i - bar x)^2 )`

Empirically for values of n between 12 and 5,000 the statistic `ln(1-W)` is approximately normally distributed with mean and standard deviation values:

`mu = 0.0038915(ln n)^3 - 0.083751(ln n)^2 - 0.31082(ln n) -1.5861` and

`sigma = e^(0.0030301(ln n)^2 - 0.082676(ln n) - 0.4803)`

Step 6: Finally we calculate the z-statistics

`z = (ln(1-W) - mu) / sigma`

using the standard normal distribution. If the p-value`<= alpha` then we reject the null hypothesis that the data is normally distributed.

Congratulations if you made it this far! The method is lengthy to explain but the implementation is straight forward in comparison to Anderson-Darling test.

Implementing Shapiro-Wilk Test in Excel LAMBDA

Below is the Shapiro-Wilk Test implemented in Excel LAMBDA.

dcrNormality.ShapiroWilk.Test
=LAMBDA(array, [show_details],
  LET(show_details, IF(ISOMITTED(show_details), FALSE, show_details),
    size, COUNT(array),
    u, 1 / SQRT(size),
    array_X, FILTER(TOCOL(array), TOCOL(array)<>""),
    array_Y, SORT(array_X),
    index, SEQUENCE(size,1,1,1),
    mi, NORM.S.INV((index-0.375)/(size+0.25)),
    mi_sumsqr, SUMSQ(mi),
    n, size,
    n_minus_1, n-1,
    a_n_minus_1, -3.582633*u^5+5.682633*u^4-1.752461*u^3-0.293762*u^2+0.042981*u+INDEX(mi,n_minus_1)/SQRT(mi_sumsqr),
    a_n, -2.70605*u^5+4.434685*u^4-2.07119*u^3-0.147981*u^2+0.221157*u+INDEX(mi,n)/SQRT(mi_sumsqr),
    a_1, -a_n,
    a_2, -a_n_minus_1,
    m_n, INDEX(mi, n),
    m_n_minus_1, INDEX(mi, n_minus_1),
    eps, (mi_sumsqr-2*m_n^2-2*m_n_minus_1^2)/(1-2*a_n^2-2*a_n_minus_1^2),
    ai, VSTACK(
      a_1, a_2,
      TOCOL(INDEX(mi,SEQUENCE(size-4,1,3,1)))/SQRT(eps),
      a_n_minus_1, a_n
    ),
    W, CORREL(array_Y,ai)^2,
    mean, 0.0038915*LN(size)^3-0.083751*LN(size)^2-0.31082*LN(size)-1.5861,
    stdev, EXP(0.0030301*LN(size)^2-0.082676*LN(size)-0.4803),
    z_statistics, STANDARDIZE(LN(1-W),mean,stdev),
    pvalue, 1-NORM.S.DIST(z_statistics,TRUE),
    stats, VSTACK(
      HSTACK("size", size),
      HSTACK("u", u),
      HSTACK("mi_sumsqr", mi_sumsqr),
      HSTACK("eps", eps),
      HSTACK("W", W),
      HSTACK("mean", mean),
      HSTACK("stdev", stdev),
      HSTACK("z-statistics", z_statistics),
      HSTACK("p-value", pvalue)
    ),
    header, {"Index", "array_X", "array_Y", "mi", "ai", "Stats"},
    details, VSTACK(
      header,
      HSTACK(
        index,
        array_X,
        array_Y,
        mi,
        ai,
        stats
      )
    ),
    IF(show_details, IFNA(details,""), pvalue)
  )
)

Seeing Shapiro-Wilk Test in Action

Using this LAMBDA is easy. Just pass in the data, and indicate if you want details to be shown.

The results return a p-value of 0.9216 > 0.05. We can conclude that this set of data is normally distributed.

Reminder: Anderson-Darling and Shapiro-Wilk tests on the same set of data will return different p-values. It does not mean one is better than the other, nor does it imply any contradiction.

So now you have both Anderson-Darling and Shapiro-Wilk Normality tests in LAMBDA. Whether you use one or both, you should test if you data is normally distributed to decide if you can use the mean comparison test.

But what happens if your data is not normally distributed? You will need a non-parametric comparison test, like Mann-Whitney, which I will show next. In the mean time, test if your data is normal in DC DEN!


Comments