Smart AI World

3 Ways to Speed Up and Improve Your XGBoost Models

3 Ways to Speed Up and Improve Your XGBoost Models

3 Ways to Speed Up and Improve Your XGBoost Models
Image by Editor | ChatGPT

Introduction

Extreme gradient boosting (XGBoost) is one of the most prominent machine learning techniques used not only for experimentation and analysis but also in deployed predictive solutions in industry. An XGBoost ensemble combines multiple models to address a predictive task like classification, regression, or forecasting. It trains a set of decision trees sequentially, gradually improving the quality of predictions by correcting the errors made by previous trees in the pipeline.

In a recent article, we explored the importance and ways to interpret predictions made by XGBoost models (note we use the term ‘model’ here for simplicity, even though XGBoost is an ensemble of models). This article takes another practical dive into XGBoost, this time by illustrating three strategies to speed up and improve its performance.

Initial Setup

To illustrate the three strategies to improve and speed up XGBoost models, we will use an employee dataset with demographic and financial attributes describing employees. It is publicly available in this repository.

The following code loads the dataset, removes instances containing missing values, and identifies 'income' as the target attribute we want to predict, and separates it from the features.

1. Early Stopping with Clean Data

While popularly used with complex neural network models, many don’t consider applying early stopping to ensemble approaches like XGBoost, even though it can create a great balance between efficiency and accuracy. Early stopping consists of interrupting the iterative training process once the model’s performance on a validation set stabilizes and few further improvements are made. This way, not only do we save training costs for larger ensembles trained on vast datasets, but we also help reduce the risk of overfitting the model.

This example first imports the necessary libraries and preprocesses the data to be better suited for XGBoost, namely by encoding categorical features (if any) and downcasting numerical ones for further efficiency. It then partitions the dataset into training and validation sets.

Next, the XGBoost model is trained and tested. The key trick here is to use the early_stopping_rounds optional argument when initializing our model. The value set for this argument indicates the number of consecutive training rounds without significant improvements after which the process should stop.

2. Native Categorical Handling

The second strategy is suitable for datasets containing categorical attributes. Since our employee dataset doesn’t, we will first simulate the creation of a categorical attribute, education_level, by binning the existing one describing years of education:

The key to this strategy is to process categorical features more efficiently during training. Once more, there’s a critical, lesser-known argument setting that allows this in the XGBoost model constructor: enable_categorical=True. This way, we avoid traditional one-hot encoding, which, in the case of having multiple categorical features with several categories each, can easily blow up dimensionality. A big win for efficiency here! Additionally, native categorical handling transparently learns optimal category groupings like “one vs. others”, thereby not necessarily handling them all as single categories.

Incorporating this strategy in our code is extremely simple:

3. Hyperparameter Tuning with GPU Acceleration

The third strategy may sound obvious in terms of seeking efficiency, as it is hardware-related, but its remarkable value for otherwise time-consuming processes like hyperparameter tuning is worth highlighting. You can use device="cuda" and set the runtime type to GPU (if you are working on a notebook environment like Google Colab, this is done in just one click), to speed up an XGBoost ensemble fine-tuning workflow like this:

Wrapping Up

This article showcased three hands-on examples of improving XGBoost models with a particular focus on efficiency in different parts of the modeling process. Specifically, we learned how to implement early stopping in the training process for when the error stabilizes, how to natively handle categorical features without (sometimes burdensome) one-hot encoding, and lastly, how to optimize otherwise costly processes like model fine-tuning thanks to GPU usage.


Source link

Smart AI World

Add comment