Scaling Kaggle Competitions Using XGBoost: Part 2 : Hector Martinez

Scaling Kaggle Competitions Using XGBoost: Part 2
by: Hector Martinez
blow post content copied from  PyImageSearch
click here to view original post

Table of Contents

Scaling Kaggle Competitions Using XGBoost: Part 2

In our previous tutorial, we went through the basic foundation behind XGBoost and learned how easy it was to incorporate a basic XGBoost model into our project. We went through the core essentials required to understand XGBoost, namely decision trees and ensemble learners.

Although we learned to plug XGBoost into our projects, we have yet to scratch the surface of the magic behind it. In this tutorial, we aim to reduce our idea of XGBoost as a black body and learn a bit more about the simple mathematics that makes it so good.

But the process is a step-by-step one. To avoid crowding a single tutorial with too much math, we check out the math behind today’s concept: AdaBoost. Once that is cleared, we will address Gradient Boosting before finally addressing XGBoost in the next tutorial.

In this tutorial, you will learn the underlying math behind one of the prerequisites of XGBoost.

We will also address a slightly more challenging Kaggle dataset and attempt to use XGBoost to get better results.

This lesson is the 2nd of a 4-part series on Deep Learning 108:

  1. Scaling Kaggle Competitions Using XGBoost: Part 1
  2. Scaling Kaggle Competitions Using XGBoost: Part 2 (this tutorial)
  3. Scaling Kaggle Competitions Using XGBoost: Part 3
  4. Scaling Kaggle Competitions Using XGBoost: Part 4

To learn how to figure out the math behind AdaBoost, just keep reading.

Looking for the source code to this post?

Jump Right To The Downloads Section

Scaling Kaggle Competitions Using XGBoost: Part 2

In the previous blog post of this series, we briefly covered concepts like decision trees and gradient boosting, before touching up on the concept of XGBoost.

Subsequently, we saw how easy it was to use in code. Now, let’s get our hands dirty and get closer to understanding the math behind it!


A formal definition of AdaBoost (Adaptive Boosting) is “the combination of the output of weak learners into a weighted sum, representing the final output.” But that leaves a lot of things vague. Let’s start dissecting our way to understanding what this statement means.

Since we have been dealing with trees, we will assume that our adaptive boosting technique is being applied to decision trees.

The Dataset

To begin our journey, we will consider the dummy dataset shown in Table 1.

Table 1: The Dataset.

The features of this dataset are mainly “courses” and “Project,” while the labels in the “Job” column tell us whether the person has a job right now or not. If we take the decision tree approach, we would need to figure out a proper criterion to split the dataset into its corresponding labels.

Now, a key thing to remember about adaptive boosting is that we are sequentially creating trees and carrying over the errors generated by each one to the next. While in Random Forest, we would have created full-sized trees, but in adaptive boosting, the trees created are stumps (trees with depth 1).

Now a general line of thought can be that the error generated by one single tree based on some criterion decides the nature of the second tree. Having a complex tree overfit on the data will defeat the purpose of combining many weak learners into strong ones. However, AdaBoost has shown results in more complex trees too.

Sample Weights

Our next step is assigning a sample weight to each sample in our dataset. Initially, all the samples will have the same weight, and their weights HAVE TO add up to 1 (Table 2). As we move toward our end goal, the concept of sample weights will be much clearer.

Table 2: With Added Sample Weights.

For now, since we have 7 data samples, we will assign 1/7 to each sample. This signifies that all samples are currently equally important in terms of their significance in the final structure of the decision tree.

Choosing the Right Feature

Now we have to initialize our first stump with some information.

Just by a glance, we can see that for this dummy dataset, the “courses” feature splits the dataset better than the “project” feature. Let’s see this for ourselves in Figure 1.

Figure 1: The First Stump.

The split factor here is the number of courses taken by a student. We notice that among the students who have taken more than 7 courses, all of them have a job. Conversely, for students with less than or equal to 7 courses, the majority don’t have a job.

However, the split gives us 6 correct predictions and 1 incorrect one. This means one sample in our dataset gives us the wrong prediction, if we use this particular stump.

Significance of a Stump

Now, our particular stump has made one wrong prediction. To calculate the error, we have to add the weights from all incorrectly classified samples, which in this case would be 1/7, since only a single sample is getting classified incorrectly.

Now, a key difference between AdaBoost and Random Forest is that in the former, one stump might have more weightage in the final output than other stumps. So the question here becomes how to calculate

\left(\displaystyle\frac{1}{2} \right) \ln\left(\displaystyle\frac{1 - \text{loss}}{\text{loss}}\right).

So, if we insert our loss value (1/7) in this formula, we get 0.89. Notice that the formula should give us values between 0 and 1. So 0.89 is a very high value, telling us that this particular stump has a big say in the final output of the combination of stumps.

Calculating the New Sample Weights

If we have built a classifier that gives us the correct prediction over all samples but one, our natural course of action should be to make sure we focus more on that particular sample to have it grouped with the right class.

Currently, all the samples have the same weight (1/7). But we want our classifier to focus more on the wrongly classified sample. For that, we have to make changes to the weights.

Remember, as we change the weight of the sample, the weights of all other samples will also shift, as long as all of the weights will add to 1.

The equation for calculating the new weights for the wrongly classified samples is

\text{old weight} \times e^\text{(significance of the model)}.

Plugging in the values from our dummy dataset becomes

(1/7) \times e^{(0.89)},

giving us a value of 0.347.

Previously, the sample weight for this particular sample was 1/7, which becomes ≈0.142. The new weight is 0.347, meaning that the sample significance is increased. Now, we need to shift the rest of the sample weights accordingly.

The formula for that is

(1/7) \times e^{(-0.89)},

which comes to 0.0586. So now, instead of 1/7, the 6 correctly predicted samples will have weight 0.0586, while the wrongly predicted sample will have weight 0.347.

The final step here is to normalize the weights, so they add up to 1, which makes the wrongly predicted sample to have weight 0.494 and the other samples to have weight 0.084.

Moving Forward: The Subsequent Stumps

Now there are several ways to determine how the next stump will be. Our priority currently is to make sure the second stump classifies the sample with the larger weight (which we found to be 0.494 in the last iteration).

The focus of this sample is because this sample caused the error in our previous stump.

While there is no standard way of selecting the new stump, as most work fine, a popular method is to create a new dataset altogether from our current dataset and its sample weights.

Let’s recall how far we are currently (Table 3).

Table 3: New Weights for the Samples.

Our new dataset will come from random sampling based on the current weights. We will be creating buckets of ranges. The ranges will be as follows:

  • We choose the first sample for values between 0 and 0.084.
  • We choose the second sample for values between 0.084 and 0.168.
  • We choose the third sample for values 0.168 and 0.252.
  • We choose the fourth sample for values 0.252 and 0.746.

By now, I hope you have figured out the pattern we are following here. The buckets are formed based on the weights: 0.084 (first sample), 0.084 + 0.084 (second sample), 0.084 + 0.084 + 0.084 (third sample), 0.084 + 0.084 + 0.084 + 0.494 (fourth sample), and so on.

Naturally, since the fourth sample has a larger range, it will appear more times in our next dataset if random sampling is applied. So our new dataset will have 7 entries, but it might have 4 entries belonging to the name “Tom.” We reset the weights for the new dataset, but intuitively, since “Tom” appears more often, it will significantly affect how the tree turns out.

Hence, we repeat the process to find the best feature to create our stump. This time, due to the dataset being different, we will create the stump with a different feature rather than the number of courses.

Note how we have been saying that the first stump’s output will determine how the second tree works. Here we saw how the weights received from the first stump determined how the second tree would work. This process is rinsed and repeated for N number of stumps.

Piecing It Together

We have learned about stumps and the significance of their outputs. Let’s say we feed one test sample to our group of stumps. The first stump gives a 1, the second one gives a 1, the third one gives a 0, while the fourth one also gives a 0.

But since each tree has a significance value attached to it, we will get a final weighted output, determining the nature of the test sample fed to our stumps.

With the completion of AdaBoost, we are one more step closer to understanding the XGBoost algorithm. Now, we will tackle a slightly intermediate Kaggle task and solve it using XGBoost.

Configuring Your Development Environment

To follow this guide, you need to have the OpenCV library installed on your system.

Luckily, OpenCV is pip-installable:

$ pip install opencv-contrib-python

If you need help configuring your development environment for OpenCV, we highly recommend that you read our pip install OpenCV guide — it will have you up and running in a matter of minutes.

Having Problems Configuring Your Development Environment?

Having trouble configuring your dev environment? Want access to pre-configured Jupyter Notebooks running on Google Colab? Be sure to join PyImageSearch University — you’ll be up and running with this tutorial in a matter of minutes.

All that said, are you:

  • Short on time?
  • Learning on your employer’s administratively locked system?
  • Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
  • Ready to run the code right now on your Windows, macOS, or Linux system?

Then join PyImageSearch University today!

Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides that are pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.

And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!

Setting Up the Prerequisites

Today, we will be tackling the USA Real Estate Dataset. Let’s start by importing the necessary packages for our project.

# import the necessary packages
import pandas as pd
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

In our imports, we have pandas, xgboost, and some utility functions from scikit-learn like mean_squared_error and train_test_split on Lines 2-5.

# load the data in the form of a csv
estData =   pd.read_csv("/content/realtor-data.csv")
# drop NaN values from the dataset
estData = estData.dropna()
# split the labels and remove non-numeric data
y = estData["price"].values
X = estData.drop(["price"], axis=1).select_dtypes(exclude=['object'])

On Line 8, we use read_csv to load the csv file as a Pandas Dataframe. A bit of exploratory data analysis (EDA) on the dataset would show many NaN (Not-a-Number or Undefined) values. These can hinder our training, so we drop all tuples with NaN values on Line 11.

We proceed to prepare the labels and features on Lines 14 and 15. We exclude any column which doesn’t contain numerical values in our feature set. It is possible to include that data, but we need to convert it into one-hot encoding.

Our final dataset is going to look something like Table 4.

Table 4: The Features.

We will use these features to determine a house’s price (label).

Building the Model

Our next step is to initialize and train the XGBoost model.

# create the train test split
xTrain, xTest, yTrain, yTest = train_test_split(X, y)
# define the XGBoost regressor according to your specifications
xgbModel = xgb.XGBRegressor(
# fit the data into the model, yTrain,
             verbose = False)
# calculate the importance of each feature used
impFeat = pd.DataFrame(xgbModel.feature_importances_.reshape(1, -1), columns=X.columns)

On Line 18, we use the train_test_split function to split our dataset into training and test sets. For the XGBoost model, we initialize it using 1000 trees and a max tree depth of 4 (Lines 21-26).

We fit the training data into our model on Lines 29 and 30. Once our training is complete, we can assess which features our model gives more importance to (Table 5).

Table 5: Significance Distribution over Features.

Table 5 shows that the feature “bath” has the most significance.

Assessing the Model

Our next step is to see how well our model would perform on unseen test data.

# get predictions on test data
yPred = xgbModel.predict(xTest)
# store the msq error from the predictions
msqErr = mean_squared_error(yPred, yTest)
# assess your model’s results
xgbModel.score(xTest, yTest)

On Lines 36-39, we get our model’s predictions on the testing dataset and calculate the mean-squared error (MSE). Since this is a regression model, be reassured by a high MSE value.

Our final step is to use the model.score function to get an accuracy value for the testing set (Line 42).

The accuracy shows that our model has nearly 95% accuracy on the testing set.

What's next? I recommend PyImageSearch University.

Course information:
60+ total classes • 64+ hours of on-demand code walkthrough videos • Last updated: Dec 2022
★★★★★ 4.84 (128 Ratings) • 15,800+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?

That’s not the case.

All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.

If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.

Inside PyImageSearch University you'll find:

  • 60+ courses on essential computer vision, deep learning, and OpenCV topics
  • 60+ Certificates of Completion
  • 64+ hours of on-demand video
  • Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
  • Pre-configured Jupyter Notebooks in Google Colab
  • ✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
  • ✓ Access to centralized code repos for all 500+ tutorials on PyImageSearch
  • Easy one-click downloads for code, datasets, pre-trained models, etc.
  • Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University


In this tutorial, we first understood the math behind one of the prerequisites to XGBoost, Adaptive boosting. Then, like our first blog post in this series, we tackled a dataset from Kaggle and used XGBoost to attain a pretty high accuracy over the testing dataset, yet again establishing XGBoost’s dominance as one of the leading classical machine learning techniques.

Our next stop would be to figure out the math behind Gradient Boosting before we finally dissect XGBoost to its foundations.

Citation Information

Martinez, H. “Scaling Kaggle Competitions Using XGBoost: Part 2,” PyImageSearch, P. Chugh, A. R. Gosthipaty, S. Huot, K. Kidriavsteva, R. Raha, and A. Thanki, eds., 2022,

  author = {Hector Martinez},
  title = {Scaling {Kaggle} Competitions Using {XGBoost}: Part 2},
  booktitle = {PyImageSearch},
  editor = {Puneet Chugh and Aritra Roy Gosthipaty and Susan Huot and Kseniia Kidriavsteva and Ritwik Raha and Abhishek Thanki},
  year = {2022},
  note = {},

Want free GPU credits to train models?

  • We used, a GPU cloud, for all the experiments.
  • We are proud to offer PyImageSearch University students $20 worth of GPU cloud credits. Join PyImageSearch University and claim your $20 credit here.

In Deep Learning, we need to train Neural Networks. These Neural Networks can be trained on a CPU but take a lot of time. Moreover, sometimes these networks do not even fit (run) on a CPU.

To overcome this problem, we use GPUs. The problem is these GPUs are expensive and become outdated quickly.

GPUs are great because they take your Neural Network and train it quickly. The problem is that GPUs are expensive, so you don’t want to buy one and use it only occasionally. Cloud GPUs let you use a GPU and only pay for the time you are running the GPU. It’s a brilliant idea that saves you money.

JarvisLabs provides the best-in-class GPUs, and PyImageSearch University students get between 10-50 hours on a world-class GPU (time depends on the specific GPU you select).

This gives you a chance to test-drive a monstrously powerful GPU on any of our tutorials in a jiffy. So join PyImageSearch University today and try it for yourself.

Click here to get Jarvislabs credits now

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!

The post Scaling Kaggle Competitions Using XGBoost: Part 2 appeared first on PyImageSearch.

December 12, 2022 at 07:30PM
Click here for more details...

The original post is available in PyImageSearch by Hector Martinez
this post has been published as it is through automation. Automation script brings all the top bloggers post under a single umbrella.
The purpose of this blog, Follow the top Salesforce bloggers and collect all blogs in a single place through automation.