Split Your Dataset With scikit-learn's train_test_split() :
by:
blow post content copied from Real Python
click here to view original post
One of the key aspects of supervised machine learning is model evaluation and validation. When you evaluate the predictive performance of your model, it’s essential that the process be unbiased. Using train_test_split()
from the data science library scikit-learn, you can split your dataset into subsets that minimize the potential for bias in your evaluation and validation process.
In this tutorial, you’ll learn:
- Why you need to split your dataset in supervised machine learning
- Which subsets of the dataset you need for an unbiased evaluation of your model
- How to use
train_test_split()
to split your data - How to combine
train_test_split()
with prediction methods
In addition, you’ll get information on related tools from sklearn.model_selection
.
Get Your Code: Click here to download the free sample code that you’ll use to learn about splitting your dataset with scikit-learn’s train_test_split().
Take the Quiz: Test your knowledge with our interactive “Split Your Dataset With scikit-learn's train_test_split()” quiz. You’ll receive a score upon completion to help you track your learning progress:
Interactive Quiz
Split Your Dataset With scikit-learn's train_test_split()In this quiz, you'll test your understanding of how to use the train_test_split() function from the scikit-learn library to split your dataset into subsets for unbiased evaluation in machine learning.
The Importance of Data Splitting
Supervised machine learning is about creating models that precisely map the given inputs to the given outputs. Inputs are also called independent variables or predictors, while outputs may be referred to as dependent variables or responses.
How you measure the precision of your model depends on the type of a problem you’re trying to solve. In regression analysis, you typically use the coefficient of determination, root mean square error, mean absolute error, or similar quantities. For classification problems, you often apply accuracy, precision, recall, F1 score, and related indicators.
The acceptable numeric values that measure precision vary from field to field. You can find detailed explanations from Statistics By Jim, Quora, and many other resources.
What’s most important to understand is that you usually need unbiased evaluation to properly use these measures, assess the predictive performance of your model, and validate the model.
This means that you can’t evaluate the predictive performance of a model with the same data you used for training. You need evaluate the model with fresh data that hasn’t been seen by the model before. You can accomplish that by splitting your dataset before you use it.
Training, Validation, and Test Sets
Splitting your dataset is essential for an unbiased evaluation of prediction performance. In most cases, it’s enough to split your dataset randomly into three subsets:
-
The training set is applied to train or fit your model. For example, you use the training set to find the optimal weights, or coefficients, for linear regression, logistic regression, or neural networks.
-
The validation set is used for unbiased model evaluation during hyperparameter tuning. For example, when you want to find the optimal number of neurons in a neural network or the best kernel for a support vector machine, you experiment with different values. For each considered setting of hyperparameters, you fit the model with the training set and assess its performance with the validation set.
-
The test set is needed for an unbiased evaluation of the final model. You shouldn’t use it for fitting or validation.
In less complex cases, when you don’t have to tune hyperparameters, it’s okay to work with only the training and test sets.
Underfitting and Overfitting
Splitting a dataset might also be important for detecting if your model suffers from one of two very common problems, called underfitting and overfitting:
-
Underfitting is usually the consequence of a model being unable to encapsulate the relations among data. For example, this can happen when trying to represent nonlinear relations with a linear model. Underfitted models will likely have poor performance with both training and test sets.
-
Overfitting usually takes place when a model has an excessively complex structure and learns both the existing relations among data and noise. Such models often have bad generalization capabilities. Although they work well with training data, they usually yield poor performance with unseen test data.
You can find a more detailed explanation of underfitting and overfitting in Linear Regression in Python.
Prerequisites for Using train_test_split()
Now that you understand the need to split a dataset in order to perform unbiased model evaluation and identify underfitting or overfitting, you’re ready to learn how to split your own datasets.
You’ll use version 1.5.0 of scikit-learn, or sklearn
. It has many packages for data science and machine learning, but for this tutorial, you’ll focus on the model_selection
package, specifically on the function train_test_split()
.
Note: While this tutorial is tested with this specific version of scikit-learn, the features that you’ll use are core to the library and should work equivalently in other versions of scikit-learn as well.
You can install sklearn
with pip
:
Read the full article at https://realpython.com/train-test-split-python-data/ »
[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]
July 15, 2024 at 07:30PM
Click here for more details...
=============================
The original post is available in Real Python by
this post has been published as it is through automation. Automation script brings all the top bloggers post under a single umbrella.
The purpose of this blog, Follow the top Salesforce bloggers and collect all blogs in a single place through automation.
============================
Post a Comment