Shuffling your data
π‘π’π£π€π₯
In DC-ML's Supporting Functions, I created an Excel LAMBDA function, SelectData, to split data into training and testing datasets. This assumes data is randomly collected. If it isn't shuffling before splitting can increase confidence in the results.
In this blog, Iβll show how to shuffle your dataset.
Why Shuffle your data
Splitting data helps prevent overfitting by using the test dataset to evaluate model accuracy on unseen data.
Shuffling data prevents models from learning ordered patterns that could introduce bias. By randomizing the data, the model generalizes better to unseen data, improving performance and accuracy.
ShuffleData
The ShuffleData function is simple. It generates a random array (randArray) matching the dataset's rows and sorts the dataset using SORTBY based on this array.
=LAMBDA(array, [headers], LET( randArray, RANDARRAY(ROWS(array), 1), sortArray, SORTBY(array, randArray), IF(ISOMITTED(headers), sortArray, VSTACK(headers, sortArray) ) ) )
Visualising the effect of the function
Below is a screenshot of ShuffleData.
The table on the left shows the original data, while the one on the right displays the shuffled version. Notice the data remains intact, but formatting (like percentages) may be lost.
Remember to shuffle your data before training your model. Stay tuned for more from DC-DEN!
Comments
Post a Comment