In machine learning, it is essential to evaluate the performance of a model on unseen data. One way to do this is by dividing the data into two sets: a training set and a testing set. The training set is used to train the model, while the testing set is used to evaluate the model’s performance.
One way to split the data into these two sets is through the sklearn train test split function from the sklearn.model_selection module. This function randomly splits the data into a specified ratio of training and testing data.
In this article, we will fully explain the train_test_split function and provide examples of how to use it in Python using various examples.
What is the sklearn train test split function?
The sklearn train test split function is a method in the sklearn.model_selection module that allows us to split a dataset into two subsets: a training set and a testing set. The training set is used to train a machine learning model, while the testing set is used to evaluate the model’s performance.
Here is the basic syntax of the train_test_split function:
#syntax of sklearn train test split function train_test_split(X, y, test_size=0.25, random_state=None, shuffle=True, stratify=None)
Let’s break down the parameters of this function:
X
: This is the feature data, which is a 2D array of shape (n_samples, n_features).y
: This is the target data, which is a 1D array of shape (n_samples).test_size
: This is the proportion of the dataset that should be allocated to the testing set. By default, it is set to 0.25, meaning that 25% of the data will be allocated to the testing set.random_state
: This is the seed for the random number generator used to split the data. If this is set to a specific integer, the data will be split in the same way each time the code is run. If it is set toNone
, the data will be split differently each time the code is run.shuffle
: This is a boolean value that specifies whether the data should be shuffled before it is split. By default, it is set toTrue
, meaning that the data will be shuffled before it is split.stratify
: This is a 1D array of shape (n_samples) that specifies the class labels of the samples. If this is set, the function will ensure that the proportion of classes in the training and testing sets is the same as the proportion of classes in the original dataset.

How to use the sklearn train test split function
Now that we’ve covered the basic syntax of the train_test_split function, let’s go through an example of how to use it.
First, we’ll need to import the necessary modules:
# importing the sklearn train test split function from sklearn.model_selection import train_test_split
The train_test_split function is a function in the sklearn.model_selection module that allows you to split a dataset into two sets: a training set and a testing set. You can use this function to evaluate the performance of a machine learning model by training it on the training set and then testing its performance on the testing set.
Here’s an example of how you can use the train_test_split function:
# importing the sklearn train test split function from sklearn.model_selection import train_test_split # Assume that you have a dataset with features and labels stored in X and y, respectively X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
In this example, the train_test_split function will randomly split the dataset into a training set and a testing set, with the training set taking up 80% of the data and the testing set taking up the remaining 20%. The X_train and y_train variables will contain the features and labels for the training set, while the X_test and y_test variables will contain the features and labels for the testing set.
You can also specify the random_state parameter to ensure that the same split is generated each time you run the function. This can be useful if you want to be able to reproduce your results.
# specifying the random state X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
You can also specify the stratify parameter, which ensures that the class distribution is preserved in both the training and testing sets. This can be useful if your dataset is imbalanced (i.e., if some classes have significantly more examples than others).
# stratify parameters X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
Now let us explain each of the parameters in the sklearn train test split function.
Setting the sklearn test size parameter
The test_size parameter in the train_test_split function determines the proportion of the dataset that will be allocated to the testing set. It can be specified as a float between 0 and 1, or as an integer representing the number of samples in the testing set.
For example, if you set test_size=0.2, this will allocate 20% of the data to the testing set and 80% to the training set. If you set test_size=100, this will allocate 100 samples to the testing set and the remaining samples to the training set.
By default, the test_size parameter is set to 0.25, which means that the testing set will take up 25% of the data and the training set will take up the remaining 75%.
Here’s an example of how you can specify the test_size parameter:
# importing the sklearn train test split function from sklearn.model_selection import train_test_split # Split the data into a training set and a testing set, with the testing set taking up 20% of the data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # Split the data into a training set and a testing set, with the testing set taking up 100 samples X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=100)
It’s generally a good idea to set aside a separate testing set to evaluate the performance of your machine learning model. By default, the train_test_split function uses a random sampling method to split the data into the training and testing sets, so you can be confident that the training and testing sets are representative of the overall dataset.
Setting the random_state parameter
The random_state parameter in the train_test_split function allows you to specify a seed for the random number generator that is used to shuffle the data before splitting it into the training and testing sets. Setting the random_state parameter to a fixed value will ensure that the same split is generated each time you run the train_test_split function, as long as the input data is the same.
Here’s an example of how you can specify the random_state parameter:
# importing the sklearn train test split function from sklearn.model_selection import train_test_split # Split the data into a training set and a testing set, with the testing set taking up 20% of the data # and the random number generator seeded with 42 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
By default, the random_state parameter is not set, so the random number generator will be initialized with a different seed each time you run the train_test_split function. This means that the resulting training and testing sets will be different each time you run the function, even if the input data is the same.
It’s generally a good idea to set the random_state parameter if you want to be able to reproduce your results. This can be especially important when you are comparing the performance of different machine learning models, as it allows you to ensure that the models are being trained and tested on the same data.
Using stratified sampling
Stratified sampling is a sampling method that ensures that the class distribution of the samples in the training and testing sets is the same as the class distribution in the overall dataset. This can be useful if your dataset is imbalanced (i.e., if some classes have significantly more examples than others).
To use stratified sampling with the train_test_split function in sklearn, you can set the stratify parameter to the labels of your dataset. The train_test_split function will then ensure that the class distribution in the training and testing sets is the same as the class distribution in the overall dataset.
Here’s an example of how you can use stratified sampling with the train_test_split function:
# importing the sklearn train test split function from sklearn.model_selection import train_test_split # Split the data into a training set and a testing set, with the testing set taking up 20% of the data # and the class distribution preserved in both sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
In this example, the train_test_split function will split the data into a training set and a testing set, with the testing set taking up 20% of the data. It will also ensure that the class distribution in both the training and testing sets is the same as the class distribution in the overall dataset.
Other split functions in sklearn
The sklearn library provides several functions for splitting datasets into training and testing sets, in addition to the train_test_split function. Here are a few examples:
StratifiedKFold: This function splits the dataset into a specified number of folds, ensuring that the class distribution is preserved in each fold. This can be useful for cross-validation, where you train and evaluate your model on different folds of the data.
StratifiedShuffleSplit: This function splits the dataset into a specified number of train/test splits, shuffling the data before each split and ensuring that the class distribution is preserved in both the training and testing sets.
GroupShuffleSplit: This function is similar to StratifiedShuffleSplit, but it allows you to specify a grouping variable that defines the splits. This can be useful if you want to ensure that samples within the same group are always either in the training set or the testing set.
TimeSeriesSplit: This function is specifically designed for time series data. It splits the dataset into train/test splits, ensuring that each split is contiguous in time and that the training set precedes the testing set.
Summary
The train_test_split
function in the sklearn
library is a useful tool for splitting a dataset into a training set and a testing set. It allows you to evaluate the performance of a machine learning model by training it on the training set and then testing its performance on the testing set.
There are several parameters that you can use to customize the split, including:
test_size
: The proportion of the dataset that should be allocated to the testing set. This can be specified as a float between 0 and 1, or as an integer representing the number of samples in the testing set.random_state
: A seed for the random number generator that is used to shuffle the data before splitting it into the training and testing sets. Setting this parameter to a fixed value will ensure that the same split is generated each time you run thetrain_test_split
function, as long as the input data is the same.stratify
: The labels of the dataset, which are used to ensure that the class distribution is preserved in both the training and testing sets. This can be useful if your dataset is imbalanced (i.e., if some classes have significantly more examples than others).
In addition to the train_test_split
function, sklearn
also provides several other split functions that are designed for specific types of data or scenarios. These include StratifiedKFold
, StratifiedShuffleSplit
, GroupShuffleSplit
, and TimeSeriesSplit
. In this article, we learned about all these points one by one and by taking examples.