Posts

What is Linear Regression?

Image
Linear Regression is the modelling of the relationship of a dependent variable to one or more independent variables. You want to predict the value of a dependent variable given other independent variable values. The aim of linear regression is to find the line with the best fit for the given data. Simple and Multiple Linear Regression Simple linear regression is the case of only one independent variable. The equation is written as: `y = b + mx` Multiple linear regression has many independent variables and the equation is written as: `y = b + m_1x_1 + m_2x_2 + m_3x_3 + ...` where: `y` is the dependent variable `x_i` are the independent variables `m_i` are the coefficient for the corresponding `x_i` variables `b` is a constant, sometimes known as error or offset `b` also has a special property. It is the y-intersect when `x_i = 0` for all `i`. For simple linear regression `m` and `b` are calculated using the formula: `m = sum_((x - barx)(y - bary))/sum_((x - barx)^2)` `b = bary - m barx`

Supporting Functions for DC-ML

Image
I will be using some tools to support my data mining functions. I will put them here for your reference. SelectData This function filters a set of data by rows. The default is every 4 in 5 is selected as training data. Every 5th row is for validation data. dcrML.Help.SelectData =LAMBDA(array, selectTrain, [headers], [ratioTrain], [ratioValidate], LET( ratioTrain, IF(ISOMITTED(ratioTrain), 4, ratioTrain), ratioValidate, IF(ISOMITTED(ratioValidate), 1, ratioValidate), selectTrain, IF(ISOMITTED(selectTrain), TRUE, selectTrain), ratioTotal, ratioTrain + ratioValidate, selected, IF(selectTrain, FILTER(array, MOD(ROW(array),ratioTotal) < ratioTrain), FILTER(array, MOD(ROW(array),ratioTotal) >= ratioTrain) ), IF(ISOMITTED(headers), selected, VSTACK(headers, selected) ) ) ) GetHeaders This function is overloaded. If dataHeaders are provided, it returns them. However if none provided, it returns a sequential headers: "F

Data Mining or Machine Learning

Image
I covered a number of statistical tests using Excel LAMBDA. The reason for using Excel LAMBDA was its ubiquity and undemanding learning curve . While there are more statistical inferences test, I only covered those that I commonly used. If however you think other common ones, please let me know. I would be interested as well. Data Mining or Machine Learning When I started data analysis, the term data mining made sense. The techniques used within Data Mining is with the intention of identifying patterns within a data set. The problem came when I started searching more of a topic from data mining, they keep popping up in  machine learning . Machine Learning  is the process of computers learning in a way that mimics human learning or through algorithms. To accomplish this machine learning use data mining techniques as the process requires identification of patterns. While there is a difference between data mining and machine learning, do not be surprise of the overlap or if you start wond

Correlation Coefficient Matrix

Image
Recently I wanted to find the correlation coefficient between multiple groups of data. A correlation coefficient shows the relationship between two groups of data . A positive correlation means as one variable increase the other increases, while a negative correlation means as one variable increases the other variable decreases. The correlation of variable with itself is always 1. A zero correlation indicates no relationship. Correlation coefficient is always between -1 and 1. Positive, Zero and Negative Correlation Excel provides the CORREL  function for this. It is fairly easy to use as shown by its syntax: CORREL(array1, array2) A correlation matrix is a table showing the correlation between pairs of variables . You could put the coefficients in a table like the one shown below. Implementing a Correlation Matrix function I thought of using LAMBDA instead. Such correlation matrix would be n x n table, where n is the number of data groups; number of columns from the d

Camel Case, Snake Case and Other Notations

Image
Camel Case   Snake Case There are a few syntax styles among programmers. I mentioned snake and camel case previously. Here are a few of the different notations: snake_case_notation cameCaseNotation PascalCaseNotation nocasenotation kebab-case-notation Excel LAMBDA is not case sensitive. So while you don't have to use these notations, without them code reading becomes difficult. I also recommend keeping Excel native functions in upper case. e.g. SUM, AVERAGE, COUNT . This to distinguish Excel functions from your own functions and variables.  Snake Case Notation I used snake case notation in Excel LAMBDA because LAMBDA didn't have an IDE (yet!) and I was writing the functions in a plain text editor. Snake case notation is easy to read as the individual words are spaced out. grp_sum, BYCOL(array, LAMBDA(p_column, SUM(p_column))), my_var, long_function_name(array), Camel Case Notation I switched to camel case notation when I started implementing data m

Which Test To Use?

Image
You conducted your survey and collected your data. You plotted your charts and suspect a certain conclusion. Which test should you use to validate your hypothesis? Having learnt different statistical tools and showing their implementation in Excel LAMBDA, we want a reference to guide us. Qualitative and Quantitative Data The first is to distinguish the type of data. Data can be qualitative or quantitative . Qualitative data can be nominal or ordinal. Nominal data are named categorised data, for example gender:{male, female} and direction:{north, south, east, west}. Ordinal data are categories with implied order, for example size:{small, medium, large}, school grades:{A, B, C, D, E} and service rating:{very poor, poor, average, good, great}. The order however are not equally spaced, i.e. you cannot equate the difference between poor and average, with the difference between average and good. Quantitative data can be interval or ratio. Unlike qualitative data, quantitative data are equal

Kruskal-Wallis Test - Comparing multiple non-parametric data sets

Image
Previously I showed one-way  ANOVA  for comparing samples means. It assumes the samples are normally distributed. For non-parametric samples, we have Kruskal-Wallis test. Kruskal-Wallis the alternative to one-way ANOVA when your samples do not follow the normal distribution. Distribution 2 samples > 2 samples Normal Two Sample Mean Test ANOVA Non-parametric Mann-Whitney Kruskal-Wallis How does Kruskal-Wallis test work? Kruskal-Wallis test is used to test if there is any statistically significant differences between the medians of three or more independent (unrelated) groups. It is similar to Mann-Whitney test in that it sorts the data altogether, ranks them and then calculates the test statistics. But where Mann-Whitney uses NORM.S.DIST to calculate the test statistics, Kruskal-Wallis uses CHISQ.DIST instead. Implementing Kruskal Wallis in LAMBDA Step 1 : Group data together We can group the data together using