When to Use Random Forests
Random forests usually work fairly well out of the box, without the need to perform feature engineering, balance the dataset, or tweak the model’s hyperparameters. They are robust to datasets that include both numerical and categorical features. However, random forests are less easy to visualize and understand than a single decision tree, and they still exhibit signs of overfitting on particularly noisy datasets.
In particular, random forests are very well-suited for parallelization and efficient processing of large databases. Each of the decision trees functions completely independently, with its own subset of data points and features. You can train each tree at the same time in parallel, which massively speeds up the training process.
Another advantage of random forests is that they can automatically perform the task of feature selection: deciding which features of the data are most relevant in answering a given question. For example, suppose that you run an e-commerce store that sells shoes, and you notice that your return rates have spiked in recent months. However, you collect dozens of different data points for each order—from the type of shoe to the time of day the order was made—which makes it extremely difficult to understand the cause of these returns.
By feeding this data into a random forest, you can train each individual decision tree on a subset of the features, and then have the forest predict whether or not a given order will be returned. The trees that are most accurate at predicting returns are those that have identified the most relevant features, i.e. the reasons why the order was returned in the first place. Based on these findings, you can then work to improve your return rates, whether it’s through better email campaigns or more accurate product images.
Here at Very, we successfully used random forests for a “fake news” detection project that analyzed news articles to determine their veracity. The algorithm collected information about the article and website, including features such as the word count and reading level. For example, if the article was over a certain number of words, then it was more likely to be a real article and not “fake news.”
The future is all about the data. Having it, and being able to draw conclusions from it. That means companies that successfully use predictive analytics and machine learning to inform business decisions have a competitive edge. If you want that edge, we can help. Get in touch.