Data Requirements
data-requirements.Rmd
library(nixtlar)
#> Registered S3 method overwritten by 'tsibble':
#> method from
#> as_tibble.grouped_df dplyr
This vignette explains the data requirements for using any of the
core functions of nixtlar
:
# Core functions of `nixtlar`
- nixtlar::nixtla_client_forecast()
- nixtlar::nixtla_client_historic()
- nixtlar::nixtla_client_detect_anomalies()
- nixtlar::nixtla_client_cross_validation()
- nixtlar::nixtla_client_plot()
1. Input requirements
nixtlar
supports input data in the form of data frames
and tsibbles
. When working with the latter,
nixtlar
will do some transformations in the background to
comply with TimeGPT
’s data requirements, but the output
will always be a tsibble
with no additional action required
on your part.
Whether you use a data frame or a tsibble
, when using
any of the core functions of nixtlar
, you must include the
following two columns:
Date column: This column should contain timestamps, which may be formatted as character strings (
yyyy-mm-dd
oryyyy-mm-dd HH:MM:SS
), integers, or date-time objects. The character string format is preferred, althoughnixtlar
is compatible with the latter two types. The default name for this column isds
. If your dataset uses a different name, please specify it by setting the parametertime_col="your_time_column_name"
.Target column: This column should contain the target variable to forecast, and it must be numeric. The default name for this column is
y
. If your data uses a different name, please specify it by setting the parametertarget_col="your target column"
.
2. Multiple series
If you are working with multiple series, you must include a column
with a unique identifier for each of them. This column should contain
character strings or integers. Unlike the Python SDK, which defaults
to unique_id
, nixtlar
does not have a default
name for this identifier column. This is one of the few
differences between nixtlar and the Python SDK. You
must specify the name of this identifier column if it
exists in your dataset. Set this by using
id_col="your_unique_id_column"
.
# sample valid input
df <- nixtlar::electricity
head(df)
#> unique_id ds y
#> 1 BE 2016-10-22 00:00:00 70.00
#> 2 BE 2016-10-22 01:00:00 37.10
#> 3 BE 2016-10-22 02:00:00 37.10
#> 4 BE 2016-10-22 03:00:00 44.75
#> 5 BE 2016-10-22 04:00:00 37.10
#> 6 BE 2016-10-22 05:00:00 35.61
str(df)
#> 'data.frame': 8400 obs. of 3 variables:
#> $ unique_id: chr "BE" "BE" "BE" "BE" ...
#> $ ds : chr "2016-10-22 00:00:00" "2016-10-22 01:00:00" "2016-10-22 02:00:00" "2016-10-22 03:00:00" ...
#> $ y : num 70 37.1 37.1 44.8 37.1 ...
3. Exogenous variables
When using exogenous variables, you need to
- Add the historical exogenous variables to your input data, and
- Create a dataset with the future values of said exogenous variables, ensuring it spans the entire forecast horizon. This dataset should include a column with the appropriate timestamps and, if available, the unique identifiers, in the formats explained in the previous sections.
Please note that all columns after the target column will be
considered exogenous variables. Hence, if you have additional columns in
your dataset that are not exogenous variables, you must remove them
before using any of the core functions of nixtlar
.
# sample valid input with exogenous variables
df <- nixtlar::electricity_exo_vars
head(df)
#> unique_id ds y Exogenous1 Exogenous2 day_0 day_1 day_2
#> 1 BE 2016-10-22 00:00:00 70.00 49593 57253 0 0 0
#> 2 BE 2016-10-22 01:00:00 37.10 46073 51887 0 0 0
#> 3 BE 2016-10-22 02:00:00 37.10 44927 51896 0 0 0
#> 4 BE 2016-10-22 03:00:00 44.75 44483 48428 0 0 0
#> 5 BE 2016-10-22 04:00:00 37.10 44338 46721 0 0 0
#> 6 BE 2016-10-22 05:00:00 35.61 44504 46303 0 0 0
#> day_3 day_4 day_5 day_6
#> 1 0 0 1 0
#> 2 0 0 1 0
#> 3 0 0 1 0
#> 4 0 0 1 0
#> 5 0 0 1 0
#> 6 0 0 1 0
future_exo_vars <- nixtlar::electricity_future_exo_vars
head(future_exo_vars)
#> unique_id ds Exogenous1 Exogenous2 day_0 day_1 day_2 day_3
#> 1 BE 2016-12-31 00:00:00 64108 70318 0 0 0 0
#> 2 BE 2016-12-31 01:00:00 62492 67898 0 0 0 0
#> 3 BE 2016-12-31 02:00:00 61571 68379 0 0 0 0
#> 4 BE 2016-12-31 03:00:00 60381 64972 0 0 0 0
#> 5 BE 2016-12-31 04:00:00 60298 62900 0 0 0 0
#> 6 BE 2016-12-31 05:00:00 60339 62364 0 0 0 0
#> day_4 day_5 day_6
#> 1 0 1 0
#> 2 0 1 0
#> 3 0 1 0
#> 4 0 1 0
#> 5 0 1 0
#> 6 0 1 0
To learn more about how to use exogenous variables, please refer to the Exogenous variables vignette.
4. Missing values
When using TimeGPT
via nixtlar
, you need to
ensure that:
No Missing Values in Target Column: The target column must not contain any missing values (NA).
Continuous Date Sequence: The dates must be continuous and without any gaps, from the start date to the end date, matching the frequency of the data.
Currently, nixtlar does not provide any functionality to fill missing values or dates. To learn more about this, please refer to the vignette on Special Topics.
5. Minimum data requirements
The minimum size per series to obtain results from
nixtlar::nixtla_client_forecast
is one, regardless of the
frequency of the data. Keep in mind, however, that this will produce
results with limited accuracy.
For certain scenarios, more than one observation may be necessary:
- When using the parameters
level
,quantiles
, orfinetune_steps
. - When incorporating exogenous variables.
- When including historical forecasts by setting
add_history=TRUE
.
The minimum data requirement varies with the frequency of the data, detailed in the official TimeGPT documentation.
When using nixtlar::nixtla_client_cross_validation
, you
also need to consider the forecast horizon (h
), the number
of windows (n_windows
) and the step size
(step_size
). The formula for the minimum data points
required per series is:
Here, refers to the values specified in the table from the official documentation.