Linear Regression: Why you should reinvent Excel's LINEST?

In the previous article on Linear Regression, I mentioned Excel's LINEST function. But if you tried using the returned coefficients, you may notice something peculiar.

The order of the returned linear coefficients is in the reverse order of the input data.

LINEST documents:

The equation for the line is:

`y = m_1x_1 + m_2x_2 + ... + m_nx_n+ b`

if there are multiple ranges of x-values, where the dependent y-values are a function of the independent x-values. The m-values are coefficients corresponding to each x-value, and b is a constant value. Note that y, x, and m can be vectors. The array that the LINEST function returns is `{m_n, m_(n-1), ..., m_1, b}`. 

The input is in the order 1st, 2nd, 3rd, ... but the returned coefficients are in the reverse. And if you were to use the coefficients to predict y for a given `x_1, x_2, x_3, ...`, you would either swap the x-s around or the coefficients around. This isn't intuitive.

For this reason you should reinvent LINEST. The intent is to keep the order of coefficients returned to be consistent with the order of the input variables. And while we are at this, we will designate the constant `b` as 0th order.

To summarise, we return the coefficients as `{b, m_1, m_2, ..., m_(n-1), m_n}`

We will create two LAMBDA functions: dcrML.Linear.Fit to return the coefficients in the above specified order, while dcrML.Linear.Predict will take in these coefficients to predict the y values. The verb usage Fit and Predict mimics that from Python's scikit-learn library.

Fit

We implement Fit using Excel's LINEST to get the coefficients. And then reorder these with the help of SEQUENCE and INDEX. GetHeaders is a supporting function I wrote earlier.

dcrML.Linear.Fit
=LAMBDA(knownY, knownXs, [headerY], [headerXs], [showDetails],
  LET(
    linearArray, LINEST(knownY, knownXs, TRUE),
    numCoeff, COUNT(linearArray),
    seq, SEQUENCE(1, numCoeff, numCoeff, -1),
    linearCoeff, INDEX(linearArray, 1, seq),
    headerY, dcrML.Help.GetHeaders(knownY, headerY, "Y"),
    headerXs, dcrML.Help.GetHeaders(knownXs, headerXs),
    IF(showDetails,
      VSTACK(
        HSTACK(headerY, headerXs), linearCoeff
      ),
      linearCoeff
    )
  )
)

Predict

We implement Predict assuming the coefficients are in the order given by Fit. Since it has been ordered "nicely", we can use BYROW to multiply the input data with the coefficients directly.

dcrML.Linear.Predict
=LAMBDA(coeffs, newX, [headerY],
  LET(
    const, CHOOSECOLS(coeffs, 1),
    weights, DROP(coeffs, 0, 1),
    predictY, BYROW(newX * weights, LAMBDA(pRow, SUM(pRow) + const)),
    IF(ISOMITTED(headerY),
      predictY,
      VSTACK(headerY, predictY)
    )
  )
)

Let's see it in action

For most of my machine learning examples, I will be using the Iris flower data set which is easily obtainable.

To find the relationship between the Petal Width and its Sepal Length, Sepal Width and Petal Length, we simple call dcrML.Linear.Fit, passing the parameters as shown in the screenshot below.

We can compare against Excel's LINEST though we must keep in mind the coefficients have been swapped.


We can also compare with Excel's TREND by passing the coefficients in F3:I3 to dcrML.Linear.Predict.


This is gives us confidences that the formula is correct and best of all, the order of the coefficients have been fixed.

So, do have a go using Linear.Fit and Linear.Predict and tell me what you think in DC-DEN!


Comments