Special Topics
special-topics.Rmd
library(nixtlar)
#> Registered S3 method overwritten by 'tsibble':
#> method from
#> as_tibble.grouped_df dplyr
1. Handling missing values
Before using TimeGPT
, you need to ensure that:
- The target column contains no missing values
(
NA
).
- Given the frequency of the data, the dates are continuous, with no missing dates between the start and the end dates.
Regarding the second point, it is worth mentioning that it is possible to have multiple time series that start and end on different dates, but each series must contain uninterrupted data for its given time frame.
There are several ways to check for missing values in R. One method
is with the any
and is.na
functions from base
R.
df <- nixtlar::electricity # load data
# create some missing values at random
index <- sample(nrow(df), 10)
df$y[index] <- NA
# check for missing values
any(is.na(df)) # will return TRUE if there are missing values
#> [1] TRUE
If you find missing values in your data, you need to decide how to fill them, which is very context-dependent. For example, if you are dealing with daily retail data, a missing value most likely indicates that there were no sales on that day, and you can probably fill it with zero. However, if you are working with hourly temperature data, a missing value likely means that the sensor was not functioning correctly, and you might prefer to use interpolation to fill the missing values. Whatever you decide to do, always keep in mind the nature of your data.
Checking if there are missing dates is more complicated since it
depends on the frequency of the data. Sometimes plotting can help spot
large gaps. nixtlar
has a plotting function called
nixtla_client_plot
that can be used for this.
However, this method is ineffective when the missing dates are not continuous. One possible solution is to compare the dates for every unique id with a vector of dates generated using the start date, the end date, and the frequency of your data. This requires knowing such information, which can become tricky when working with hundreds or thousands of time series.
2. Specifying the frequency of your data
The frequency parameter is crucial when working with time series data
because it informs the model about the expected intervals between data
points. The core functions of nixtlar
that interface with
TimeGPT
, such as nixtla_client_forecast
,
nixtla_client_historic
,
nixtla_client_detect_anomalies
, and
nixtla_client_cross_validation
, require you to specify the
freq
parameter, although in some cases nixtlar
can deduce it from your data.
TimeGPT
supports the following aliases:
Frequency | Alias |
---|---|
Yearly | Y or A |
Quarterly | Q |
Monthly | M |
Weekly (starting Sundays) | W |
Daily | D |
Hourly | H |
Minute-level | min |
Second-level | S |
Both minute-level and second-level frequencies can be preceded by an integer, such as “10min” or “30S”.
The default value of the frequency parameter is NULL
.
When this parameter is not specified, nixtlar
will attempt
to determine the frequency of your data. For most common frequencies,
such as yearly, quarterly, monthly, weekly, daily, and hourly,
nixtlar
can identify the frequency from the
time_col
column, regardless of whether you are working with
data frames or tsibbles.
df <- nixtlar::electricity
fcst <- nixtlar::nixtla_client_forecast(df, h = 8, id_col = "unique_id", level = c(80,95)) # freq = "H"
#> Frequency chosen: H
# infer the frequency when `freq` is not specified
When the frequency is not apparent, nixtlar
will default
to a daily frequency. Currently, nixtlar
cannot
automatically detect subhourly frequencies. In such cases, you must
explicitly specify these frequencies. For example, for subhourly data,
you should set freq="15min"
for fifteen-minute data
intervals or freq="S"
for data that is taken every second.
Only the aliases “min” and “S” are allowed for minute and
second-level frequencies.
Moreover, if you are dealing with irregular frequencies, such as
business days or custom holiday calendars, you must specify them
directly. For instance, for business days, you should set
freq="B"
, which corresponds to the pandas alias for
business day frequency. Please refer to pandas’s
offset aliases for more information, and keep in mind that if you
are dealing with minute-level or second-level data you can only use the
aliases “min” and “S”.
When dealing with weekly frequency (W
),
nixtlar
assumes that the weeks start on Sunday.
Consequently, TimeGPT
will return dates corresponding to
weeks that begin on Sundays. If your weeks start on a different day, for
example, Mondays, you should specify the frequency as
W-MON
. You can select any day of the week with the aliases
W-MON
, W-TUE
, W-WED
,
W-THU
, W-FRI
, and W-SAT
.