"We claim to be a cutting-edge AI company to our customers. But they have no idea what our algorithm is. We provide great results, and they are happy. But at the end of the day, most of our models are simple- either regression or XGBoost."
A few years ago, I heard the above statement from a friend who was the AI head at a highly-funded startup. At that time, I knew regression very well but wasn’t entirely sure about XGBoost, though I had come across it in various contexts.
Following this conversation, I kept hearing from industry experts about how extensively XGBoost is used. So, I decided to take a deep dive. It was a fascinating journey that I will share someday.
However, this post is not about XGBoost. It is about a random forest - something that preceded XGBoost.
Let us start with a bit of history: Condorcet's jury theorem
Marquis de Condorcet, a French philosopher, and mathematician, proposed a fascinating theorem in 1785: "if each person is more than 50% correct, then adding more people to vote increases the probability that the majority is correct."
This idea directly relates to random forest. But first, a quick recap on decision trees - the basic building blocks of random forest.
Imagine you have a dataset with the height and weight of three animals - cow, dog, and pig. You build a decision tree classifier to predict the animal type. With the right decision tree, you can achieve 100% accuracy.
The problem however is that decision trees are highly sensitive to the dataset. Slight changes in data can lead to a completely different tree structure. Even perturbing just 5% of the training dataset with Gaussian noise can drastically alter the decision tree.
The solution? Ensemble models
Decision trees are interpretable, but they lack robustness. This is where random forest comes in.
Random forest is an ensemble learning method where multiple decision trees work together to make predictions. The idea is simple: if individual models are at least 50% accurate, their collective decision will be better than any single model- just like Condorcet’s jury theorem.
How random forest works
1️⃣ Create random subsets of data.
2️⃣ Train each tree on a random subset of features.
3️⃣ Each tree makes a prediction, and the majority vote determines the final output.
The result? A more robust and accurate model.
I have just released a detailed video on the random forest that covers the theory, a walkthrough of an awesome interactive blog from MLU-explain, and implementation in code.
You will see that in the example dataset we use, individual decision trees can go up to 92% accuracy, but with random forest, for trees of the same depth, you can reach up to 97% accuracy.