Gradient Boosting · Freddie Karlbom

The workhorse of modern machine learning. After falling down the rabbit hole of understanding the construction of regression trees (and creating my own implementation), I am now ready to plug it into a gradient boosting machine.

Below, my notes on the algorithm, but first some links to material that does a much better job of explaining it here and here.

Pseudo algorithm

Create an initial estimator (based on target variable mean) for data
Based on error in initial prediction, recursively add a regression tree that tries to minimise the error of the previous prediction

Intuition

The additive forward expansion of the residual is really an example of gradient descent, with each expansion signifying another step taken.

Using squared loss, which is the most intuitive;

$L(y, F(x)) = \frac{(y − F(x))^2}{2}$ $J = \sum{L(y_i, F(x_i))}$

$\frac{\partial J}{\partial F(x_i)} = F(x_i) - y_i$ $y_i - F(x_i) = - \frac{\partial J}{\partial F(x_i)}$

Thus, we can reformulate;

$F(x_i) := F(x_i) + h(x_i)$ $F(x_i) := F(x_i) + y - F(x_i)$ $F(x_i) := F(x_i) - 1\frac{\partial J}{\partial F(x_i)}$ $\theta_i := \theta_i - \lambda \frac{\partial J}{\partial \theta_i}$

Where $lambda$ signifies the step size.

Using L1 (Absolute) Loss

Alternatives to the squared loss include Absolute loss and Huber loss which both are more robust to outliers in the data.

Since absolute loss derives to:

$- \frac{\partial J}{\partial F(x_i)} = sign(y_i - F(x_i))$

Important to note is that the absolute loss is used when constructing the trees, but the predicted target value of the tree will be the median of the examples in the leaf (compared to the mean for squared loss).

GBM for Classification

Train one model per class predicting 0, 1. I assume, you can add a softmax in the end to get something like class probabilities, where the sum of probabilities equals 1.

# Verify right venv is used
import platform
assert platform.python_version() == '3.7.0'

Regression target variable

So, let’s get to it. To keep it simple, we’ll predict a regression target variable and use L2 loss.

Since my custom tree model is not very optimized (profiling the performance and optimising might be a nice next challenge, including rewriting parts in Cython), the trees are a bit slow to build so I limit to ten trees and set a relatively high learning rate.

from sklearn.datasets import load_boston
import pandas as pd
import numpy as np

data = load_boston()
X = data.data
y = data.target

X_df = pd.DataFrame(X, columns=data.feature_names)

X_df.head()

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33

import sys
sys.path.insert(0, '/Users/freddie.karlbom/dev/algorithms')
from decisiontree.decisiontree import DecisionTree
# Importing my custom regression tree class, can be found at
# https://github.com/FreddieK/algorithms-in-python

y_pred = np.mean(y) # initial model

learning_rate = 0.5
nr_trees = 10

print(DecisionTree.score(y, y_pred)) # initial model mae

for i in range(nr_trees):
    dx = y - y_pred
    tree = DecisionTree()

    dx_df = pd.DataFrame(dx)
    dx_df.columns = ['y']

    tree.build_tree(X_df, dx_df)
    y_pred += learning_rate*np.array(tree.predict(X_df))

    mae = DecisionTree.score(y, y_pred)
    print(mae)

647207423956008
64223543227224
5060203764688502
0035297382705246
7370645289421143
5532973608407707
3876131975762247
27056874997753
201764300656173
139500731765435
0867045115081146

Takeaways

Seeing the mean average error decrease consistently between iterations means the model keep improving with each new tree added
When I decided to get more familiar with gradient boosting, I had no clue that I would end up most of the time reading up on decision trees
Note to self: when refactoring this code into a class for my repository, I’ll store all the trees in the model so that predictions can be made on a hold-out set post training as well…

Pseudo algorithm

Intuition

Using L1 (Absolute) Loss

GBM for Classification

Regression target variable

Takeaways

Comments

Related Posts

Matrix Factorization and Advanced Techniques - Week 6 10 May 2020

Matrix Factorization and Advanced Techniques - Week 5 09 May 2020

DeepFM Paper - Notes 29 Apr 2020