Posts

Showing posts with the label DC-Stats

Out-of-the-Box Part 2 - One Sample Mean Testing with Z.TEST

Image
Previously we saw the One Sample Mean test implemented using LAMBDA  by calculating the test statistics. Excel provides something similar using the  Z.TEST  function. The Z.TEST function compares the average of a sample if it is statistically less than or equal to a test value. The formula is stated as: Z.TEST(array, x, [sigma]) where array is the range of data to be tested against x x is the value of test. sigma is optional. If the population standard deviation is known, put it here. Otherwise leave it empty, and Excel will use the sample standard deviation instead. In effect you are testing to see if the average of the array is less than or equal to x `bar(array) < x` Comparing for Less Than or Equal To Observe the box plot of 5 sample distributions above. We want to check if the distributions averages are less than or equal to the value of 50 . We see samples A and B less than 50, sample C around 50, and samples D and E more than 50. NOTE: Where possible do box plots of the

Out-of-the-Box Part 1 - Proportion Testing with CHISQ.TEST

Image
So Excel Lambda is great! You can create custom functions that are not included in Excel. And we have seen in this blog how you could implement various statistical hypothesis testing .  But then what are those Excel TEST functions for? What hypothesis testing can I do with these Out-of-the-box TEST functions? In the next few posts, I will describe some common hypothesis testing you could do: Proportion Test with  CHISQ.TEST One Sample Mean Test with  Z.TEST Two Sample Mean Test with  T.TEST Variance Test with  F.TEST Excel's Chi Squared Test for One Proportion Testing In Excel's documentation, CHISQ.TEST is described as a test for independence . It does this by doing a goodness-to-fit test on how well the data matches the expected. This means we can also use it for one proportion testing. To use CHISQ.TEST we will need to compare the sample proportion against the expected proportion. But unlike DC-Den's One Proportion Testing  where we only need to specify the expected

Correlation Coefficient Matrix

Image
Recently I wanted to find the correlation coefficient between multiple groups of data. A correlation coefficient shows the relationship between two groups of data . A positive correlation means as one variable increase the other increases, while a negative correlation means as one variable increases the other variable decreases. The correlation of variable with itself is always 1. A zero correlation indicates no relationship. Correlation coefficient is always between -1 and 1. Positive, Zero and Negative Correlation Excel provides the CORREL  function for this. It is fairly easy to use as shown by its syntax: CORREL(array1, array2) A correlation matrix is a table showing the correlation between pairs of variables . You could put the coefficients in a table like the one shown below. Implementing a Correlation Matrix function I thought of using LAMBDA instead. Such correlation matrix would be n x n table, where n is the number of data groups; number of columns from the d

Which Test To Use?

Image
You conducted your survey and collected your data. You plotted your charts and suspect a certain conclusion. Which test should you use to validate your hypothesis? Having learnt different statistical tools and showing their implementation in Excel LAMBDA, we want a reference to guide us. Qualitative and Quantitative Data The first is to distinguish the type of data. Data can be qualitative or quantitative . Qualitative data can be nominal or ordinal. Nominal data are named categorised data, for example gender:{male, female} and direction:{north, south, east, west}. Ordinal data are categories with implied order, for example size:{small, medium, large}, school grades:{A, B, C, D, E} and service rating:{very poor, poor, average, good, great}. The order however are not equally spaced, i.e. you cannot equate the difference between poor and average, with the difference between average and good. Quantitative data can be interval or ratio. Unlike qualitative data, quantitative data are equal

Kruskal-Wallis Test - Comparing multiple non-parametric data sets

Image
Previously I showed one-way  ANOVA  for comparing samples means. It assumes the samples are normally distributed. For non-parametric samples, we have Kruskal-Wallis test. Kruskal-Wallis the alternative to one-way ANOVA when your samples do not follow the normal distribution. Distribution 2 samples > 2 samples Normal Two Sample Mean Test ANOVA Non-parametric Mann-Whitney Kruskal-Wallis How does Kruskal-Wallis test work? Kruskal-Wallis test is used to test if there is any statistically significant differences between the medians of three or more independent (unrelated) groups. It is similar to Mann-Whitney test in that it sorts the data altogether, ranks them and then calculates the test statistics. But where Mann-Whitney uses NORM.S.DIST to calculate the test statistics, Kruskal-Wallis uses CHISQ.DIST instead. Implementing Kruskal Wallis in LAMBDA Step 1 : Group data together We can group the data together using

ANOVA - Comparing averages of multiple data sets

Image
In an earlier post, we used the Two Sample Mean Test to compare the mean of two samples (duh). If you have 3 samples you could use the same test, and compare between pairs of samples, i.e. sample 1 & sample 2, sample 2 & sample 3, and sample 1 & sample 3. The problem arises when you have many more sample sets. Running this test means keeping track of the permutations. This approach is wastefully repetitive and you are likely to make mistakes. Also, sometimes, you only want to know if there is at least one sample that is different from the rest. Scientific or medical break throughs are rare. In most experiments and trials, there will be no change/difference. A single test that tells you that there is no difference between the samples allows you to move on.  Is there a better method to compare the mean of multiple samples? Enter ANOVA Analysis Of Variance (ANOVA) is a statistical method used to compare the means between samples. It tells you if there's any significant d

When it is not Normal... The Mann-Whitney Test

Image
The Mann-Whitney Test Mann-Whitney test helps you compare two sets of data when they are not normally distributed. I would use Mann-Whitney test only after I confirm using the  Anderson-Darling test . I should add that I run a box-plot first before running the AD test. I used to think that Mann-Whitney compares the medians of two data sets. But in the process of implementing Mann-Whitney in Excel LAMBDA, I found out that Mann-Whitney test compares the mean ranks, in doing so, it determines if the two data set came from the same population. Step 1: Group the two samples together and sort the data in ascending order, but retain their sample origin. Step 2: Rank the combined data. Step 3: Separate the data back into the two samples. Sum the ranks for each sample. Step 4: Compute the test statistics U `U_1 = R_1 - (n_1(n_1+1))/2` and  `U_2 = R_2 - (n_2(n_2+1))/2` Step 5: Choose the small U to calculate the equivalent z-statistics. `z = (U - m_U) / sigma_U` where `m_U` and `sigma_U` are the

Is Your Data Normal? The Shapiro-Wilk Test

Image
  Shapiro-Wilk and Anderson Darling Tests The Shapiro-Wilk test is another statistical test to assess the normality of a dataset. Just as Anderson-Darling test , both are goodness-of-fit test. Why would we need another normality test? Let me put some points forward: The Shapiro-Wilk SW test is specifically a test for normality. The Anderson-Darling AD test can be applied to other distributions, not just normal distribution. That said, my AD implementation  was specific for normal distribution. AD test is more sensitive to deviations in the tails of the distribution, while SW test is sensitive to deviations across the entire distribution. Both AD and SW can be used to calculate p-values, but the p-value will not be comparable. This means there can be instances where AD and SW disagree . From implementation perspective, I find AD test easier to write than the SW test (a lot more steps), but AD test was much more difficult to understand compared to SW test. So, which test should you use:

Is Your Data Normal? The Anderson-Darling Test

Image
Have a look at the Excel Histogram chart above based on a given set of data. Does it look normally distributed? How could you find out? Well, you could calculate the kurtosis and skewness, e.g. using  Descriptive Statistics , or generate a BoxPlot. But still, how can you know with certainty? Descriptive Statistics showing Kurtosis and Skewness BoxPlot of Sample Data Why is Checking for Normality Important? In previous blogs on One Sample Mean Test and Two Sample Mean Test , I assumed the given data is normally distributed. If the sample is normally distributed then choosing a parametric test like t-test is applicable since we are using the mean (central tendency) and standard deviation (spread) to calculate the t-statistics . Assuming normal distribution is a simplification. It is therefore only to justify using parametric tests. However, mean and standard deviation should not be used to describe a sample that is not normally distr

Extending Two Sample Mean Test for Arrays

Image
In the previous blog, I implemented a Two Sample Mean hypothesis test with Excel LAMBDA. In this blog, I will create a new function that takes in two sample data arrays directly, calculate the necessary statistics, and then reuse the Two Sample Mean function to perform the hypothesis test. This approach will ensure the same result. Parameter Changes In the earlier implementation, the dcrMean.Two.TTest input parameters are: dcrMean.Two.TTest =LAMBDA(sample_mean_1, sample_stdev_1, sample_size_1, sample_mean_2, sample_stdev_2, sample_size_2, [tail], [show_details], To take in two arrays, I will create a new function dcrMean.Two.TTest.Array with input parameters as follows: dcrMean.Two.TTest.Array =LAMBDA(array_1, array_2, [tail], [show_details], Within this new function, the mean and standard deviation of the two input arrays will be pre-calculated before passing the values into the hypothesis test. The implementation will be like this: =LAMBDA(array_1, array_2, [tail], [show_

Creating Two Sample Mean Test with LAMBDA

Image
You received glass window samples from two different vendors. You wish to check if the thickness of the samples are the same. Based on the samples, you obtain the following data: Vendor Sample Sample 1 Sample 2 Mean (cm) 2.0 2.3 Standard Deviation (cm) 0.2 0.3 Number of Samples 15 20 At first glance sample 2 is 0.3cm thicker than sample 1. But are their thickness significantly different? To make a comparison you would need a perform a Two Sample Mean test . Two Sample Mean Test A two sample mean test compares two sample distribution means against each other. It differs from the one sample mean test that compares against a target value. We write the Null Hypothesis as the means of sample 1 and of sample 2 are equal. `H_0: mu_1 = mu_2` And the Alternative Hypothesis as the mean of sample 1 and of sample 2 are not equal. `H_1: mu_1 != mu_2` We could also test

Extending One Sample Mean Test for Arrays

Image
In the previous blog, I implemented a One Sample Mean hypothesis test in Excel LAMBDA. When given an array of sample data, we firstly calculate the sample mean, standard deviation and size, before passing these into the function. The function compares a sample mean against an expected population mean. In this blog, we will create a new function to take the sample data array directly, calculate the necessary statistics, and then reuse the One Sample Mean function to perform the hypothesis test. This extension ensures the same result regardless of the approach used. Parameter Changes In the earlier One Sample Mean dcrMean.One.TTest the input parameters were defined as: dcrMean.One.TTest =LAMBDA(expected_mean, sample_mean, sample_stdev, sample_size, [tail], [show_details], To implement One Sample Mean test for array,  dcrMean.One.TTest.Array  input parameter will take the entire data array: dcrMean.One.TTest.Array =LAMBDA(expected_mean, array, [tail], [show_details], Within this

Creating One Sample Mean Test with LAMBDA

Image
Suppose you visited a village. The village elders claims to possess a secret elixir that makes their children grower taller! You are aware of the national 12-year old  average height is 150cm. And you could measure the village's 12-year old children's height. Plotting the village's children's height gives you a distribution like in the graph above. At first glance the village's children does seem taller than the national average. Is the village's children really taller? Is the distribution significantly different? One Sample Mean Test A one sample mean test compares a sample distribution's mean against a target value. It is similar to the one proportion test but the sample data is continuous, giving a normal distribution curve instead of a yes/no binary result. We write the Null Hypothesis as the sample mean equal to the population mean. `H_0: mu = mu_0` And the Alternative Hypothesis as the sample mean not equal to the population mean. `H_1: mu != mu_0` We