Gradient Boosting

The workhorse of modern machine learning. After falling down the rabbit hole of understanding the construction of regression trees (and creating my own implementation), I am now ready to plug it into a gradient boosting machine.

Below, my notes on the algorithm, but first some links to material that does a much better job of explaining it here and here.

Pseudo algorithm

  • Create an initial estimator (based on target variable mean) for data
  • Based on error in initial prediction, recursively add a regression tree that tries to minimise the error of the previous prediction

Intuition

The additive forward expansion of the residual is really an example of gradient descent, with each expansion signifying another step taken.

Using squared loss, which is the most intuitive;

Thus, we can reformulate;

Where $lambda$ signifies the step size.

Using L1 (Absolute) Loss

Alternatives to the squared loss include Absolute loss and Huber loss which both are more robust to outliers in the data.

Since absolute loss derives to:

Important to note is that the absolute loss is used when constructing the trees, but the predicted target value of the tree will be the median of the examples in the leaf (compared to the mean for squared loss).

GBM for Classification

Train one model per class predicting 0, 1. I assume, you can add a softmax in the end to get something like class probabilities, where the sum of probabilities equals 1.

# Verify right venv is used
import platform
assert platform.python_version() == '3.7.0'

Regression target variable

So, let’s get to it. To keep it simple, we’ll predict a regression target variable and use L2 loss.

Since my custom tree model is not very optimized (profiling the performance and optimising might be a nice next challenge, including rewriting parts in Cython), the trees are a bit slow to build so I limit to ten trees and set a relatively high learning rate.

from sklearn.datasets import load_boston
import pandas as pd
import numpy as np

data = load_boston()
X = data.data
y = data.target

X_df = pd.DataFrame(X, columns=data.feature_names)
X_df.head()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33
import sys
sys.path.insert(0, '/Users/freddie.karlbom/dev/algorithms')
from decisiontree.decisiontree import DecisionTree
# Importing my custom regression tree class, can be found at
# https://github.com/FreddieK/algorithms-in-python
y_pred = np.mean(y) # initial model

learning_rate = 0.5
nr_trees = 10
print(DecisionTree.score(y, y_pred)) # initial model mae

for i in range(nr_trees):
    dx = y - y_pred
    tree = DecisionTree()

    dx_df = pd.DataFrame(dx)
    dx_df.columns = ['y']

    tree.build_tree(X_df, dx_df)
    y_pred += learning_rate*np.array(tree.predict(X_df))

    mae = DecisionTree.score(y, y_pred)
    print(mae)
6.647207423956008
4.64223543227224
3.5060203764688502
3.0035297382705246
2.7370645289421143
2.5532973608407707
2.3876131975762247
2.27056874997753
2.201764300656173
2.139500731765435
2.0867045115081146

Takeaways

  • Seeing the mean average error decrease consistently between iterations means the model keep improving with each new tree added
  • When I decided to get more familiar with gradient boosting, I had no clue that I would end up most of the time reading up on decision trees
  • Note to self: when refactoring this code into a class for my repository, I’ll store all the trees in the model so that predictions can be made on a hold-out set post training as well…

Comments