Shuffling your data

πŸ‚‘πŸ‚’πŸ‚£πŸ‚€πŸ‚₯

In DC-ML's Supporting Functions, I created an Excel LAMBDA function, SelectData, to split data into training and testing datasets. This assumes data is randomly collected. If it isn't shuffling before splitting can increase confidence in the results.

In this blog, I’ll show how to shuffle your dataset.

Why Shuffle your data

Splitting data helps prevent overfitting by using the test dataset to evaluate model accuracy on unseen data.

Shuffling data prevents models from learning ordered patterns that could introduce bias. By randomizing the data, the model generalizes better to unseen data, improving performance and accuracy.

ShuffleData

The ShuffleData function is simple. It generates a random array (randArray) matching the dataset's rows and sorts the dataset using SORTBY based on this array.

=LAMBDA(array, [headers],
  LET(
    randArray, RANDARRAY(ROWS(array), 1),
    sortArray, SORTBY(array, randArray),
    IF(ISOMITTED(headers),
      sortArray,
      VSTACK(headers, sortArray)
    )
  )
)

Visualising the effect of the function

Below is a screenshot of ShuffleData.

The table on the left shows the original data, while the one on the right displays the shuffled version. Notice the data remains intact, but formatting (like percentages) may be lost.

Remember to shuffle your data before training your model. Stay tuned for more from DC-DEN!

Comments