How to help tree-based models extrapolate?

Shanshan Guo
3 min readDec 23, 2020

Tree-based models such as decision trees, random forests and gradient boosting trees are popular in machine learning as they provide high accuracy and are also fast to train. They are not sensitive to outliers either, which contributes to their robustness in many use cases. However, one well-known problem of tree-based models is that they can’t extrapolate beyond the values of the training data. For example, for regression problems, the predictions generated by a decision tree are the average value of all the training data in a particular leaf.

Now let’s play with some real data to visualize the problem. Below shows the Google trend index (weekly) for ‘machine learning’ and ‘deep learning’ from 2014–11–23 to 2017–12–17:

Say we want to use the Google trend index for machine learning (denoted as x) to predict the future index for deep learning (denoted as y). Let’s first create the features:

Before proceeding to the modeling step, we have to decide on the cross-validation scheme. Since the data is time-series data, let’s use a time-series split to prevent leaking of future data into the training set:

Source: sklearn documentation

We’ll be using 3-fold cross-validation for this problem and root-mean-square-error as the evaluation metric.

Modeling

Now we have the data formulated for supervised learning models. Let’s see how the tree-based models performed using cross-validation:

See the plateaus in the predictions? All the tree-based models are unable to predict values larger than those in the training set.

How can we help the tree-based models extrapolate? We can ‘difference’ the time series (compute the differences between consecutive observations) to make it ‘stationary’:

Now let’s see how the models performed on the differenced data:

We have a much lower RMSE. But don’t forget this is the RMSE on the differenced data. We have to transform the predictions back by computing a cumulative sum from previous values:

See how random forest and xgboost were able to extrapolate! Then we can proceed to hyperparameter tuning to further improve the model performance.

The code can be found here:

https://github.com/shanminlin/tree_models_extrapolate

--

--