Posts

Happy New Year 2024!

Image
Happy New Year🎈🎉! When DC-DEN blog started in 2021, I really wanted to write on data centre stuff. But I realised a number of things I would show/share is either confidential or proprietary. So in 2023, I decided to extend the blog about Excel LAMBDA in the area of statistics and data mining. I hope you are taking a good break and that you will come back next year for more from  DC-DEN !

Recursion: LAMBDA calling LAMBDA

Image
Excel Lambda functions can be used to create custom, reusable functions by giving them a friendly name. This means a custom Lambda function could call itself. A recursive function is a function that calls itself and self-moves towards an exit condition. Without the exit condition, the function loops continuously unless externally terminated. Why use recursion? Some problems can be expressed in an elegant recursive structure. For example the n-th factorial is defined as: `n! = n (n-1) (n-2) ... 1` with the termination condition: `0! = 1` Recursion example using Lambda Before we start, Excel provides the FACT function to calculate the factorial of a number. You can use this to compare the results of your own. This is an example of a recursive function MyFactorial =LAMBDA(n, IF(n = 0, 1, n * MyFactorial(n-1) ) ) The IF function is used to determine the termination condition. If the termination condition has not been reached, decrement the variable by 1 and call the functi

Linear Regression: Why you should reinvent Excel's LINEST?

Image
In the previous article on  Linear Regression , I mentioned Excel's LINEST function. But if you tried using the returned coefficients, you may notice something peculiar. The order of the returned linear coefficients is in the reverse order of the input data. LINEST documents: The equation for the line is: `y = m_1x_1 + m_2x_2 + ... + m_nx_n+ b` if there are multiple ranges of x-values, where the dependent y-values are a function of the independent x-values. The m-values are coefficients corresponding to each x-value, and b is a constant value. Note that y, x, and m can be vectors. The array that the LINEST function returns is `{m_n, m_(n-1), ..., m_1, b}`.  The input is in the order 1st, 2nd, 3rd, ... but the returned coefficients are in the reverse. And if you were to use the coefficients to predict y for a given `x_1, x_2, x_3, ...`, you would either swap the x-s around or the coefficients around. This isn't intuitive. For this reason you should reinvent LINEST . The inten

What is Linear Regression?

Image
Linear Regression is the modelling of the relationship of a dependent variable to one or more independent variables. You want to predict the value of a dependent variable given other independent variable values. The aim of linear regression is to find the line with the best fit for the given data. Simple and Multiple Linear Regression Simple linear regression is the case of only one independent variable. The equation is written as: `y = b + mx` Multiple linear regression has many independent variables and the equation is written as: `y = b + m_1x_1 + m_2x_2 + m_3x_3 + ...` where: `y` is the dependent variable `x_i` are the independent variables `m_i` are the coefficient for the corresponding `x_i` variables `b` is a constant, sometimes known as error or offset `b` also has a special property. It is the y-intersect when `x_i = 0` for all `i`. For simple linear regression `m` and `b` are calculated using the formula: `m = sum_((x - barx)(y - bary))/sum_((x - barx)^2)` `b = bary - m barx`

Supporting Functions for DC-ML

Image
I will be using some tools to support my data mining functions. I will put them here for your reference. SelectData This function filters a set of data by rows. The default is every 4 in 5 is selected as training data. Every 5th row is for validation data. dcrML.Help.SelectData =LAMBDA(array, selectTrain, [headers], [ratioTrain], [ratioValidate], LET( ratioTrain, IF(ISOMITTED(ratioTrain), 4, ratioTrain), ratioValidate, IF(ISOMITTED(ratioValidate), 1, ratioValidate), selectTrain, IF(ISOMITTED(selectTrain), TRUE, selectTrain), ratioTotal, ratioTrain + ratioValidate, selected, IF(selectTrain, FILTER(array, MOD(ROW(array),ratioTotal) < ratioTrain), FILTER(array, MOD(ROW(array),ratioTotal) >= ratioTrain) ), IF(ISOMITTED(headers), selected, VSTACK(headers, selected) ) ) ) GetHeaders This function is overloaded. If dataHeaders are provided, it returns them. However if none provided, it returns a sequential headers: "F

Data Mining or Machine Learning

Image
I covered a number of statistical tests using Excel LAMBDA. The reason for using Excel LAMBDA was its ubiquity and undemanding learning curve . While there are more statistical inferences test, I only covered those that I commonly used. If however you think other common ones, please let me know. I would be interested as well. Data Mining or Machine Learning When I started data analysis, the term data mining made sense. The techniques used within Data Mining is with the intention of identifying patterns within a data set. The problem came when I started searching more of a topic from data mining, they keep popping up in  machine learning . Machine Learning  is the process of computers learning in a way that mimics human learning or through algorithms. To accomplish this machine learning use data mining techniques as the process requires identification of patterns. While there is a difference between data mining and machine learning, do not be surprise of the overlap or if you start wond

Correlation Coefficient Matrix

Image
Recently I wanted to find the correlation coefficient between multiple groups of data. A correlation coefficient shows the relationship between two groups of data . A positive correlation means as one variable increase the other increases, while a negative correlation means as one variable increases the other variable decreases. The correlation of variable with itself is always 1. A zero correlation indicates no relationship. Correlation coefficient is always between -1 and 1. Positive, Zero and Negative Correlation Excel provides the CORREL  function for this. It is fairly easy to use as shown by its syntax: CORREL(array1, array2) A correlation matrix is a table showing the correlation between pairs of variables . You could put the coefficients in a table like the one shown below. Implementing a Correlation Matrix function I thought of using LAMBDA instead. Such correlation matrix would be n x n table, where n is the number of data groups; number of columns from the d