Kruskal-Wallis Test - Comparing multiple non-parametric data sets

Previously I showed one-way ANOVA for comparing samples means. It assumes the samples are normally distributed. For non-parametric samples, we have Kruskal-Wallis test. Kruskal-Wallis the alternative to one-way ANOVA when your samples do not follow the normal distribution.

Distribution 2 samples > 2 samples
Normal Two Sample Mean Test ANOVA
Non-parametric Mann-Whitney Kruskal-Wallis

How does Kruskal-Wallis test work?

Kruskal-Wallis test is used to test if there is any statistically significant differences between the medians of three or more independent (unrelated) groups. It is similar to Mann-Whitney test in that it sorts the data altogether, ranks them and then calculates the test statistics. But where Mann-Whitney uses NORM.S.DIST to calculate the test statistics, Kruskal-Wallis uses CHISQ.DIST instead.

Implementing Kruskal Wallis in LAMBDA

Step 1: Group data together

We can group the data together using TOCOL. But we need to filter out any empty records, especially when the samples do not have the same number of records.

    array_column, FILTER(TOCOL(array,3), TOCOL(array,3)<>""),

Step 2: Sort the data

I sort the data and also the reverse sort.

    array_sorted,  SORT(array_column),
    array_rsorted, SORT(array_column, ,-1),

Together will be used the first and last occurrence of any repeated values.

    array_first, MATCH(array_sorted, array_sorted, 0),
    array_last,  array_size - MATCH(array_sorted, array_rsorted, 0) + 1,

Step 3: Rank the data

We then rank the data using these array_first and array_last.

    array_rank, (array_first + array_last)/2,
    array_ranked, IFNA(INDEX(array_rank, MATCH(array, array_sorted, 0)),""),
    col_ranked_sum,  BYCOL(array_ranked,LAMBDA(pcol,SUM(pcol))),
    col_ranked_size, BYCOL(array_ranked,LAMBDA(pcol,COUNT(pcol))),
    col_ranked_SS, col_ranked_sum^2 / col_ranked_size,
    array_ranked_SS, SUM(col_ranked_SS),

NOTE: We have to do this because Excel's RANK.AVG does not works on arrays.

Step 4: Calculate the Test Statistics

The final step is to calculate the test statistics. The p-value is derived from CHISQ.DIST using this value.

    h_statistics, 12*array_ranked_SS/(array_size*(array_size+1))-3*(array_size+1),
    df, COUNT(col_ranked_size) - 1,
    pvalue, CHISQ.DIST.RT(h_statistics, df),

Putting it all together

The following is Kruskal Wallis test in LAMBDA.

dcrNonParametric.KruskalWallis.Test
=LAMBDA(array_data_in_columns,[show_details],
  LET(
    show_details, IF(ISOMITTED(show_details), FALSE, show_details),
    array, array_data_in_columns,
    array_size, COUNT(array),
    array_column, FILTER(TOCOL(array,3), TOCOL(array,3)<>""),
    array_sorted,  SORT(array_column),
    array_rsorted, SORT(array_column, ,-1),
    array_first, MATCH(array_sorted, array_sorted, 0),
    array_last,  array_size - MATCH(array_sorted, array_rsorted, 0) + 1,
    array_rank, (array_first + array_last)/2,
    array_ranked, IFNA(INDEX(array_rank, MATCH(array, array_sorted, 0)),""),
    col_ranked_sum,  BYCOL(array_ranked,LAMBDA(pcol,SUM(pcol))),
    col_ranked_size, BYCOL(array_ranked,LAMBDA(pcol,COUNT(pcol))),
    col_ranked_SS, col_ranked_sum^2 / col_ranked_size,
    array_ranked_SS, SUM(col_ranked_SS),
    h_statistics, 12*array_ranked_SS/(array_size*(array_size+1))-3*(array_size+1),
    df, COUNT(col_ranked_size) - 1,
    pvalue, CHISQ.DIST.RT(h_statistics, df),
    details_grp, HSTACK(
      VSTACK("Kruskal Wallis Test", "Rank Sum", "Size", "Sum of Squares"),
      VSTACK(
        HSTACK("Group " & SEQUENCE(1,COLUMNS(array_ranked))),
        col_ranked_sum,
        col_ranked_size,
        col_ranked_SS
      )
    ),
    details_total, HSTACK(
      VSTACK("Total Size", "Total SS", "h_statistics", "df", "p-value"),
      VSTACK(array_size, array_ranked_SS, h_statistics, df, pvalue)
    ),    
    details, VSTACK(details_grp, details_total),
    IF(show_details, IFNA(details,""), pvalue)
  )
)

Seeing Kruskal Wallis test in action

Let's see this function in action. Suppose we have 3 sample data in columns A, B, and C. Notice that they are of different lengths.

Enter the required parameters and press ENTER.

The function returns the test statistics and p-value. In this instance the p-value of 0.01... < 5%. We can conclude that the 3 samples are statistically different from each other.

Conclusion

Kruskal Wallis test is a ranking test, an alternative to ANOVA, to compare samples when they are not normally distributed. Use Kruskal Wallis if you know your data is not normally distributed.

You are now fully equipped with various statistical tools to test your data in DC-DEN!

Comments