Learn Data Splitting From Scratch Using R Programming
Data splitting is not a cumbersome task in machine learning and it is much more important when you create a model in R. I remember my days when I made a first model using r and at that time it was an amazing experience for me because if you have self-doubt regarding r, that is complex in syntax and hard to learn, then you are in the wrong direction because this language syntax digests any developers in the world. Data splitting is not quite hard but a little bit complicated but if you think it might be. In this article, we will explore and deep dive on how to split data and prepare for your model that you want to make in future.
Table of Content
- Data splitting
- Random sampling
- Stratefied sampling
- Class imbalances
1. Data Splitting
A major goal of the machine learning process is to find an algorithm 𝑓 (𝑋)
that most accurately predicts future values (𝑌̂) based on a set of features (𝑋).
In other words, we want an algorithm that not only fits well to our past data but more importantly, one that predicts a future outcome accurately.
This is
called the generalizability of our algorithm.
How we spend our data will
help us understand how well our algorithm generalizes to unseen data. To provide an accurate understanding of the generalizability of our final optimal model, we can split our data into training and test data sets.
- Training set — these data are used to develop feature sets, train our algorithms, tune hyper-parameters, compare models and all of the other
activities required to choose a final model for example, the model we want to put
into production. - Test set — having chosen a final model, these data are used to estimate an
unbiased assessment of the model’s performance which we refer to as the
generalization error.
Given a fixed amount of data, typical recommendations for splitting your
data into training test splits include 60% of training and 40% of testing, 70%–30%
or 80%–20%.
- Spending too much in training (> 80%) won’t allow us to get a good
assessment of predictive performance. We may find a model that fits the
training data very well, but is not generalizable overfitting. - Sometimes too much spent in testing (> 40%) won’t allow us to get a good
assessment of model parameters.
I highlight some other factors that should also influence the allocation proportions.
For example,
very large training sets 𝑛 > 100K often result in only marginal gains
compared to smaller sample sizes.
Consequently, you may use a smaller training
sample to increase computation speed models built on larger training
sets often take longer to score new data sets in production.
In contrast, as
𝑝 ≥ 𝑛 where 𝑝 represents the number of features, larger samples sizes are
often required to identify consistent signals in the features.
The two most common ways of splitting data include simple random sampling and stratified sampling.
2. Random Sampling
The simplest way to split the data into training and test sets is to take a
simple random sample. This does not control for any data attributes, such as
the distribution of your response variable (𝑌). There are multiple ways to split
our data in R.
# Using base R
set.seed(123)
index_1 <- sample(1:nrow(ames), round(nrow(ames) * 0.7))
train_1 <- ames[index_1, ]
test_1 <- ames[-index_1, ]
# Using caret package
set.seed(123)
index_2 <- createDataPartition(ames$Sale_Price, p = 0.7, list = FALSE)
train_2 <- ames[index_2, ]
test_2 <- ames[-index_2, ]
# Using rsample package
set.seed(123)
split_1 <- initial_split(ames, prop = 0.7)
train_3 <- training(split_1)
test_3 <- testing(split_1)
# Using h2o package
split_2 <- h2o.splitFrame(ames.h2o, ratios = 0.7, seed = 123)
train_4 <- split_2[[1]]
test_4 <- split_2[[2]]
With sufficient sample size, this sampling approach will typically result in
a similar distribution of 𝑌 for example, Sale_Price in the ames data between your
training and test sets.
3. Stratefied Sampling
If we want to explicitly control the sampling so that our training and test
sets have similar 𝑌 distributions and we can use stratified sampling.
This is
more common with classification problems where the response variable may
be severely imbalanced 90% of observations with response Yes and
10% with response No.
we can also apply stratified sampling to
regression problems for data sets that have a small sample size and where
the response variable deviates strongly from normality, positively skewed
like Sale_Price.
With a continuous response variable, stratified sampling will
segment 𝑌 into quantiles and randomly sample from each.
Consequently, this
will help ensure a balanced representation of the response distribution in both
the training and test sets.
The easiest way to perform stratified sampling on a response variable is
to use the rsample package where you specify the response variable to
stratefy.
The figure illustrates that in our original employee attrition data
we have an imbalanced response (No: 84%, Yes: 16%).
By enforcing stratified
sampling, both our training and testing sets have approximately equal response
distributions.
# orginal response distribution
table(churn$Attrition) %>% prop.table()
# stratified sampling with the rsample package
set.seed(123)
split_strat <- initial_split(churn, prop = 0.7, strata = ”Attrition”)
train_strat <- training(split_strat)
test_strat <- testing(split_strat)
table(test_strat$Attrition) %>% prop.table()
table(test_strat$Attrition) %>% prop.table()
4. Class Imbalanced
Imbalanced data can have a significant impact on model predictions and
performance (Kuhn and Johnson, 2013).
Most often this involves classification
problems where one class has a very small proportion of observations for example,
defaults - 5% versus non-defaults -95%).
Several sampling methods have been
developed to help remedy class imbalance and most of them can be categorized
as either up-sampling or down-sampling.
Down-sampling balances the dataset by reducing the size of the abundant
classes to match the frequencies in the least prevalent class.
This method is
used when the quantity of data is sufficient. By keeping all samples in the rare
class and randomly selecting an equal number of samples in the abundant class,
a balanced new dataset can be retrieved for further modeling.
Furthermore,
the reduced sample size reduces the computation burden imposed by further
steps in the ML process.
On the contrary, up-sampling is used when the quantity of data is insufficient.
It tries to balance the dataset by increasing the size of rarer samples. Rather
than getting rid of abundant samples, new rare samples are generated by using
repetition or bootstrapping.
Thanks For Reading 😊
If you read this article till the end, please consider the following:
- Follow the author to get updates of upcoming articles 🔔
- If you like this article, feel free to consider a clap 👏🏻
- Highlight that inspired you
- Comment down your precious thoughts 💭
- Share this article by showing your love and support ❤️