Forecasting with Machine Learning models
mlforecast makes forecasting with machine learning fast & easy
We at Nixlta, are trying to make time series forecasting more accesible to everyone. In this post I'll talk about using machine learning models in forecasting tasks. I'll use an example to show what the main challanges are and then I'll introduce mlforecast, a framework that facilitates using machine learning models in forecasting. mlforecast does feature engineering and takes care of the updates for you, the user only has to provide a regressor that follows the scikit-learn API (implements fit and predict) and specify the features that she wants to use. These features can be lags, lag-based transformations and date features. (For further feature creation or an automated forecasting pipeline check fasttsfeatures and autotimeseries
For many years classical methods like ARIMA and ETS dominated the forecasting field. One of the reasons was that most of the use cases involved forecasting low-frequency series with monthly, quarterly or yearly granularity. Furthermore, there weren't many time-series datasets, so fitting a single model to each one and getting forecasts from them was straightforward.
However, in recent years, the need to forecast bigger datasets higher frequencies has risen. Bigger and higher frequency time series imposes a challenge for classical forecasting methods. Those methods aren't mean to model many time series together, and their implementation is suboptimal and slow (you have to train many models) and besides, there could be some common or shared patterns between the series that could be learned by modeling them together.
To address this problem, there have been various efforts in proposing different methods that can train a single model on many time series. Some fascinating deep learning architectures have been designed that can accurately forecast many time series like ESRNN, DeepAR, NBEATS among others. (Check nixtlats and Replicating ESRNN results for our WIP)
Traditional machine learning models like gradient boosted trees have been used as well and have shown that they can achieve very good performance as well. However, using these models with lag-based features isn't very straightforward because you have to update your features in every timestep in order to compute the predictions. Additionally, depending on your forecasting horizon and the lags that you use, at some point you run out of real values of your series to update your features, so you have to do something to fill those gaps. One possible approach is using your predictions as the values for the series and update your features using them. This is exactly what mlforecast does for you.
In the following section I'll show a very simple example with a single series to highlight the difficulties in using machine learning models in forecasting tasks. This will later motivate te use of mlforecast, a library that makes the whole process easier and faster.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
rng = np.random.RandomState(90)
serie_length = 7 * 20
dates = pd.date_range('2000-01-01', freq='D', periods=serie_length, name='ds')
y = dates.dayofweek + rng.randint(-1, 2, size=dates.size)
data = pd.DataFrame({'y': y.astype(np.float64)}, index=dates)
data.plot(marker='.', figsize=(16, 6));
Our data has daily seasonality and as you can see in the creation, it is basically just dayofweek + Uniform({-1, 0, 1}).
Let's say we want forecasts for the next 14 days, the first step would be deciding which model and features to use, so we'll create a validation set containing the last 14 days in our data.
valid_horizon = 14
train = data.head(-valid_horizon).copy()
y_valid = data.tail(valid_horizon)['y']
Now we'll try to find which lags are the most important to use as features. To do this we'll compute the autocorrelation of the series values with respect to each lag.
max_lags = 28
autocorrelations = np.empty((max_lags, 2))
autocorrelations[:, 0] = np.arange(max_lags) + 1
for lag in range(max_lags):
autocorrelations[lag, 1] = np.corrcoef(y[lag + 1:], y[:-(lag + 1)])[0, 1]
fig, ax = plt.subplots(figsize=(10, 6))
ax.bar(autocorrelations[:, 0], autocorrelations[:, 1])
ax.set(
xlabel='lag',
ylabel='Correlation coefficient',
title='Autocorrelation by lag',
xticks=[lag + 1 for lag in range(0, max_lags, 2)],
);
We can see that the most important lags are multiples of 7. As a starting point we'll try lag 7 and lag 14.
train['lag-7'] = train['y'].shift(7)
train['lag-14'] = train['y'].shift(14)
Computing lag values leaves some rows with nulls.
train.isnull().sum()
We'll drop these before training.
train_without_nulls = train.dropna()
X_train = train_without_nulls.drop(columns='y')
y_train = train_without_nulls['y']
For simplicity sake, we'll train a linear regression without intercept. Since the best model would be taking the average for each day of the week, we expect to get coefficients that are close to 0.5.
lr = LinearRegression(fit_intercept=False).fit(X_train, y_train)
lr.coef_
This model is taking $0.51 \cdot lag_7 + 0.45 \cdot lag_{14}$.
Great. We have our trained model. How can we compute the forecast for the next 14 days? Machine learning models a feature matrix X and output the predicted values y. So we need to create the feature matrix X for the next 14 days and give it to our model.
If we want to get the lag-7 for the next day, following the training set, we can just get the value in the 7th position starting from the end. The lag-7 two days after the end of the training set would be the value in the 6th position starting from the end and so on. Similarly for the lag-14.
next_lags_7 = y_train.tail(7).values
next_lags_7
next_lags_14 = y_train.tail(14).values
next_lags_14
As you may have noticed we can only get 7 of the lag-7 values from our history and we can get all 14 values for the lag-14. With this information we can only forecast the next 7 days, so we'll only take the first 7 values of the lag-14.
X_valid1 = pd.DataFrame({
'lag-7': next_lags_7,
'lag-14': next_lags_14[:7],
})
X_valid1
With these features we can compute the forecasts for the next 7 days.
forecasts_7 = lr.predict(X_valid1)
forecasts_7
These values can be interpreted as the values of our series for the next 7 days following the last training date. In order to compute the forecasts following that date we can use these values as if they were the values of our series and use them as lag-7 for the following periods.
In other words, we can fill the rest of our features matrix with these values and the real values of the lag-14.
X_valid2 = pd.DataFrame({
'lag-7': forecasts_7,
'lag-14': next_lags_14[-7:],
})
X_valid2
As you can see we're still using the real values of the lag-14 and we've plugged in our predictions as the values for the lag-7. We can now use these features to predict the remaining 7 days.
forecasts_7_14 = lr.predict(X_valid2)
y_pred = np.hstack([forecasts_7, forecasts_7_14])
y_pred
And now we have our forecasts for the next 14 days! This wasn't that painful but it wasn't pretty or easy either. And we just used lags which are the easiest feature we can have.
What if we had used lag-1? We would have needed to do this predict-update step 14 times!
And what if we had more elaborate features like the rolling mean over some lag? As you can imagine it can get quite messy and is very error prone.
With these problems in mind we created mlforecast, which is a framework to help you forecast time series using machine learning models. It takes care of all these messy details for you. You just need to give it a model and define which features you want to use and let mlforecast do the rest.
mlforecast is available in PyPI (pip install mlforecast
) as well as conda-forge (conda install -c conda-forge mlforecast
)
The previously described problem can be solved using mlforecast with the following code.
First we have to set up our data in the required format.
train_mlfcst = train.reset_index()[['ds', 'y']]
train_mlfcst.index = pd.Index(np.repeat(0, train.shape[0]), name='unique_id')
train_mlfcst.head()
This is the required input format.
- an index named unique_id that identifies each time serie. In this case we only have one but you can have as many as you want.
- a ds column with the dates.
- a y column with the values.
Now we'll import the TimeSeries transformer, where we define the features that we want to use. We'll also import the Forecast class, which will hold our transformer and model and will run the forecasting pipeline for us.
from mlforecast.core import TimeSeries
from mlforecast.forecast import Forecast
We initialize our transformer specifying the lags that we want to use.
ts = TimeSeries(lags=[7, 14])
ts
As you can see this transformer will use lag-7 and lag-14 as features. Now we define our model.
model = LinearRegression(fit_intercept=False)
We create a Forecast object with the model and the time series transformer and fit it to our data.
fcst = Forecast(model, ts)
fcst.fit(train_mlfcst)
And now we just call predict with the forecast horizon that we want.
y_pred_mlfcst = fcst.predict(14)
y_pred_mlfcst
This was a lot easier and internally this did the same as we did before. Lets verify real quick.
Check that we got the same predictions:
np.testing.assert_equal(y_pred, y_pred_mlfcst['y_pred'].values)
Check that we got the same model:
np.testing.assert_equal(lr.coef_, fcst.model.coef_)
Having this high level abstraction allows us to focus on defining the best features and model instead of worrying about implementation details. For example, we can try out different lags very easily by writing a simple function that leverages mlforecast:
def evaluate_lags(lags):
ts = TimeSeries(lags=lags)
model = LinearRegression(fit_intercept=False)
fcst = Forecast(model, ts)
fcst.fit(train_mlfcst)
print(*[f'lag-{lag:<2} coef: {fcst.model.coef_[i]:.2f}' for i, lag in enumerate(lags)], sep='\n')
y_pred = fcst.predict(14)
mse = mean_squared_error(y_valid, y_pred['y_pred'])
print(f'MSE: {mse:.2f}')
evaluate_lags([7, 14])
evaluate_lags([7, 14, 21])
evaluate_lags([7, 14, 21, 28])
Backtesting
In the previous examples we manually split our data. The Forecast object also has a backtest method that can do that for us.
We'll first get all of our data into the required format.
data_mlfcst = data.reset_index()
data_mlfcst.index = pd.Index(np.repeat(0, data_mlfcst.shape[0]), name='unique_id')
data_mlfcst
Now we instantiate a Forecast
object as we did previously and call the backtest
method instead.
backtest_fcst = Forecast(
LinearRegression(fit_intercept=False), TimeSeries(lags=[7, 14])
)
backtest_results = backtest_fcst.backtest(data_mlfcst, n_windows=2, window_size=14)
This returns a generator with the results for each window.
type(backtest_results)
result1 = next(backtest_results)
result1
result2 = next(backtest_results)
result2
result2
here is the same as the evaluation we did manually.
np.testing.assert_equal(result2['y_pred'].values, y_pred_mlfcst['y_pred'].values)
We can define a validation scheme for different lags using several windows.
def backtest(ts, model=None):
if model is None:
model = LinearRegression(fit_intercept=False)
fcst = Forecast(model, ts)
backtest_results = fcst.backtest(data_mlfcst, n_windows=4, window_size=14)
mses = []
results = []
for i, result in enumerate(backtest_results):
mses.append(mean_squared_error(result['y'], result['y_pred']))
results.append(result.rename(columns={'y_pred': f'split-{i}'}))
pd.concat(results).set_index('ds').plot(marker='.', figsize=(16, 6))
print('Splits MSE:', np.round(mses, 2))
print(f'Mean MSE: {np.mean(mses):.2f}')
backtest(TimeSeries(lags=[7, 14]))
backtest(TimeSeries(lags=[7, 14, 21]))
backtest(TimeSeries(lags=[7, 14, 21, 28]))
Lag transformations
We can specify transformations on the lags as well as just lags. The window_ops library has some implementations of different window functions. You can also define your own transformations.
Let's try a seasonal rolling mean, this takes the average over the last n
seasons, in this case it would be the average of the last n
mondays, tuesdays, etc. Computing the updates for this feature would probably be a bit annoying, however using this framework we can just pass it to lag_transforms. If the transformations takes additional arguments (additional to the values of the serie) we specify a tuple like (transform_function, arg1, arg2)
, which in this case are season_length
and window_size
.
from window_ops.rolling import seasonal_rolling_mean
help(seasonal_rolling_mean)
lag_transforms takes a dictionary where the keys are the lags that we want to apply the transformations to and the values are the transformations themselves.
ts = TimeSeries(
lag_transforms={
7: [(seasonal_rolling_mean, 7, 8)]
}
)
backtest(ts)
Date features
You can also specify date features to be computed, which are attributes of the ds column and are updated in each timestep as well. In this example the best model would be taking the average over each day of the week, which can be accomplished by doing one hot encoding on the day of the week column and fitting a linear model.
ts = TimeSeries(date_features=['dayofweek'])
model = make_pipeline(
OneHotEncoder(drop='first'), LinearRegression(fit_intercept=False)
)
backtest(ts, model)
In this post we presented mlforecast, a framework that makes the use of machine learning models in forecasting tasks fast and easy. It allows you to focus on the model and features instead of implementation details. With mlforecast you can make experiemnts in an esasier way and it has a built-in backtesting functionality to help you find the best performing model.
Although this example contained only a single time series it is able to handle thousands of them and is very efficient both time and memory wise.
Next steps
mlforecast has more features like distributed training and a CLI. If you're interested you can learn more in the following resources:
- GitHub repo: https://github.com/Nixtla/mlforecast
- Documentation: https://nixtla.github.io/mlforecast/
- Example using mlforecast in the M5 competition: https://www.kaggle.com/lemuz90/m5-mlforecast