The concept of the “wisdom of crowds” isn’t just for human psychology—it also applies to machine learning. “Ensemble learning” is the idea that a large number of weak learners can together be more intelligent than a single strong learner.
Learners that are "weak" exhibit performance in isolation that's only somewhat better than random chance. When you examine the results from many weak learners, however, you can essentially "crowdsource" a more accurate answer.
Gradient boosted trees have become one of the most popular machine learning algorithms for ensemble learning. In this article, we’ll discuss what gradient boosted trees are and how you might encounter them in real-world use cases.
Machine Learning Vocabulary
Overfitting occurs in machine learning when you create a learner with rules that are too specific. Not having enough data to train with is the most common cause of overfitting.
For example, suppose that you want to build a learner that can detect cars with manufacturing defects. In your training dataset, all of the vehicles with defects are yellow—so it would be very easy for the learner to conclude that only yellow vehicles can have defects, even though the color is completely irrelevant.
Decision trees are a machine learning learner that uses a branching tree structure to split up the dataset.
For a given data point in the dataset, at each level of the tree, you ask the question that best splits the dataset by providing the most information (whether that's the type of engine of the car or the manufacturer). You then follow the branch of the tree that corresponds to the given answer. Each of the tree’s terminal “leaf” nodes contains the model’s prediction for the data points that reach that node.
Bagging (Bootstrap Aggregating)
The first ensemble learning technique to discuss here is bagging. In bagging, you take samples of your data multiple times (with replacement) and train a new learner on each sample. The final output is the average or most frequent answer of these individual learners.
Bagging helps reduce overfitting because each learner develops different rules for different subsets of the data. These individual learners are simple, so they aren't usually in danger of overfitting.
Random forests are an ensemble learning model that are a specific type of bagging. With random forests, each decision tree model can use only a subset of the features of the dataset.
Many machine learning experts prefer random forests as their baseline model when looking for good performance on new data.
Another closely related ensemble learning concept is boosting. In boosting, you first train a single weak learner on the entire dataset. You then examine the mistakes that this first model made and train another learner to rectify these mistakes.
This iterative process of correcting for the previous model’s errors continues until you reach a satisfactory level of accuracy. Like bagging, boosting reduces overfitting by using multiple weak learners, each of which is only responsible for a subset of the data (correcting the previous learner's mistakes).
What are Gradient Boosted Trees?
Gradient boosted trees are an ensemble learning model that specifically uses decision trees and boosting to improve the model's results on a dataset. They typically have decision trees with performances that are not too strong—slightly better than chance.
A single decision tree whose results are "too good" may be overfitting the data. As a result, the ensemble model will exhibit worse performance. To avoid this in practice, machine learning experts simplify each decision tree: limiting its depth to prevent it from fitting too many features.
With gradient boosting, the dataset is treated explicitly as a numerical optimization problem to be solved with gradient descent.
Imagine that you’re somewhere in a mountainous landscape and you want to reach the lowest point as soon as possible. With so many bumps, hills, and depressions, it’s hard to know exactly which path will be the quickest. However, one reasonable way to start would be to look around you and move downward in the steepest direction.
This is the underlying concept of gradient descent: iteratively moving downward, stopping, and then reassessing your position in order to get a result that’s as small as possible.
To begin with, the first tree is trained to fit the dataset. Then, a loss function is generated from this weak model to understand the errors that it has made. The learner uses gradient descent to try to reduce errors by finding the local minimum of this loss function. Essentially, the learner moves “downward” in the direction where the loss decreases the fastest, correcting for the previous model's mistakes.
In practice, there are several implementations of the concept of gradient boosted trees that are used for machine learning problems. XGBoost has gained popularity in the last few years for its strong performance in machine learning competitions. The open-source CatBoost library is fairly similar to XGBoost, but has a few notable structural differences. Microsoft has also released its own framework LightGBM.
When to Use Gradient Boosted Trees
As we discussed in our article on neural networks, datasets that have no obvious representation such as images and audio files are best suited for deep learning. Gradient boosted trees, on the other hand, are good for problems where the most significant features of the dataset are already known in advance.
For example, data in spreadsheet format might have columns that represent features such as yearly profits, quantities, and locations. These are categories of information that humans have already decided to be important. This process of choosing the most relevant features is known as feature engineering.
Suppose that a manufacturing company wants to optimize the prices of a new item that it will release. The company can make use of its historical pricing data about other products: information such as price, location, quantity sold, and so on. This would be a strong use case for gradient boosted trees because the data is already in an amenable format.
Gradient boosted trees often hit the “sweet spot” for machine learning problems: they require a little more tweaking from experts than random forests, but need less data than a neural network. In particular, they have built-in mechanisms to prevent the problem of overfitting. If you’ve already extracted the most important features from your data and you know what kind of answers you’re looking for from the learner, then gradient boosted trees may be just the right choice.
The future is all about the data. Having it, and being able to draw conclusions from it. That means companies that successfully use predictive analytics and machine learning to inform business decisions have a competitive edge. If you want that edge, we can help. Get in touch.