The expression “You can’t see the forest for the trees” is usually used metaphorically to describe someone who focuses too much on minor details and not enough on the big picture. Similarly, in the case of machine learning, it’s better to focus less on the individual decision trees and more on the entire random forest.
In a previous article, we discussed the ins and outs of decision trees: how they work, their pros and cons, and the situations for which they’re best suited. Unfortunately, the decision tree method is prone to a few major issues, such as overfitting and a large degree of variance.
We’ll now discuss a closely related technique that aims to reduce the unpredictability of a single decision tree. Random forests are an ensemble method in machine learning that combines the wisdom of many different decision trees. By choosing the majority opinion from among all the decision trees in their collection, random forests can improve their performance and accuracy.
How Are Random Forests Used?
Like the individual decision trees that they’re composed of, random forests can be used for both classification and regression tasks. The output of a random forest for a classification task is the label that is most frequently chosen, or “voted on,” by all the trees in the forest.
Random forests can also be used for regression tasks in which the goal is to estimate a continuous value. In fact, random forests are better at regression than individual decision trees because they combine the averages of each tree in the forest. With a large collection of trees, your regression values will be smoother instead of jagged and stepwise.
How Do Random Forests Work?
As the name suggests, random forests are collections of decision trees. In particular, each decision tree in the random forest is trained on only a random subset of the data, with replacement. This concept is known as “bagging” and is very popular for its ability to reduce variance and overfitting.
Random forests are very similar to the bagging method, but they have one crucial point of distinction. In addition to being trained on only a subset of the data, the decision trees within a random forest are trained on only a random subset of the features. This technique aims to make the decision trees within the random forest more diverse, as well as lessen the impact of features that are irrelevant.
For example, suppose that you wanted to build a random forest that can classify different animals based on their physical features. One of the decision trees might look only at the animal’s ear and color, while another might look at whether it has a tail or fins. In addition to reducing unpredictability, this also has the benefit of being robust to data points that are missing certain features.
Although each individual decision tree does not look at the entire dataset, random forests build up a “critical mass” of decision trees that should be able to come to the right conclusion by combining their collective expertise.
When to Use Random Forests
Random forests usually work fairly well out of the box, without the need to perform feature engineering, balance the dataset, or tweak the model’s hyperparameters. They are robust to datasets that include both numerical and categorical features. However, random forests are less easy to visualize and understand than a single decision tree, and they still exhibit signs of overfitting on particularly noisy datasets.
In particular, random forests are very well-suited for parallelization and efficient processing of large databases. Each of the decision trees functions completely independently, with its own subset of data points and features. You can train each tree at the same time in parallel, which massively speeds up the training process.
Another advantage of random forests is that they can automatically perform the task of feature selection: deciding which features of the data are most relevant in answering a given question. For example, suppose that you run an e-commerce store that sells shoes, and you notice that your return rates have spiked in recent months. However, you collect dozens of different data points for each order—from the type of shoe to the time of day the order was made—which makes it extremely difficult to understand the cause of these returns.
By feeding this data into a random forest, you can train each individual decision tree on a subset of the features, and then have the forest predict whether or not a given order will be returned. The trees that are most accurate at predicting returns are those that have identified the most relevant features, i.e. the reasons why the order was returned in the first place. Based on these findings, you can then work to improve your return rates, whether it’s through better email campaigns or more accurate product images.
Here at Very, we successfully used random forests for a “fake news” detection project that analyzed news articles to determine their veracity. The algorithm collected information about the article and website, including features such as the word count and reading level. For example, if the article was over a certain number of words, then it was more likely to be a real article and not “fake news.”
The future is all about the data. Having it, and being able to draw conclusions from it. That means companies that successfully use predictive analytics and machine learning to inform business decisions have a competitive edge. If you want that edge, we can help. Get in touch.