I am trying to use pytorch_forecasting
TemporalFusionTransformer
to predict oceanic environmental variables.
My data is organized as follows:
Location (sensor's locations):
- loc_a
- loc_b
- loc_c
Ocean environment features/timeseries related to locations:
- current
- velocity
- direction
- sea_surface_height
- ssh
- astronomical_tide
- at
- simulated_current
- simulated_velocity
- simulated_direction
- simulated_sea_surface_height
- simulated_ssh
Prediction target:
- To proof the concept: current velocity from location_a
- The dream: Current and ssh from all locations
Some charactheristics related to data:
- time_varying_unknown_reals:
- current
- sea_surface_height
- time_varying_known_reals:
- astronomical_tide
- simulated_current
- simulated_sea_surface_height
- Some locations don't have all related environment timeseries
- All the timeseries have missing data with different sizes of gaps
Just for testing, I tried to run a model only with the time_varying_unknown
features represented by a single column labeled as value
folowed by columns location
, env_feature
(current
and sea_surface_height
) and env_var
(velocity
, direction
, ssh
).
The TimeSeriesDataSet()
configuration:
training = TimeSeriesDataSet(
time_df[lambda x: x.hours_from_start <= training_cutoff],
time_idx="hours_from_start",
target='value',
group_ids=['location', 'env_feature', 'env_var'],
min_encoder_length=max_encoder_length // 2,
max_encoder_length=max_encoder_length,
min_prediction_length=1,
max_prediction_length=max_prediction_length,
static_categoricals=['location', 'env_feature', 'env_var'],
time_varying_known_reals=["hours_from_start", "day",
"day_of_week", "month", 'hour'],
time_varying_unknown_reals=['value'],
target_normalizer=GroupNormalizer(groups=['location', 'env_feature', 'env_var']),
#constant_fill_strategy={"value": 0.0},
add_relative_time_idx=True,
add_target_scales=True,
add_encoder_length=True,
allow_missing_timesteps=True,
)
A sample of related df.head(5)
:
location | env_feature | env_var | value | hours_from_start | days_from_start | date | hour | day | day_of_week | month | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | barnabe | current | velocity | 0.0724509 | 0 | 0 | 2020-01-01 00:00:00+00:00 | 0 | 1 | 2 | 2 |
1 | barnabe | current | velocity | 0.111463 | 1 | 0 | 2020-01-01 01:00:00+00:00 | 1 | 1 | 2 | 2 |
2 | barnabe | current | velocity | 0.0814537 | 2 | 0 | 2020-01-01 02:00:00+00:00 | 2 | 1 | 2 | 2 |
3 | barnabe | current | velocity | 0.0788815 | 3 | 0 | 2020-01-01 03:00:00+00:00 | 3 | 1 | 2 | 2 |
4 | barnabe | current | velocity | 0.187344 | 4 | 0 | 2020-01-01 04:00:00+00:00 | 4 | 1 | 2 | 2 |
Considering th example above, the dataset only accepts the unknown data. If you add all the known data and add a column with binary labels known_value
and unknown_value
, I don't know how it would be possible to set this information to the TimeSeriesDataSet()
function.
I'm trying to think about how to create a single dataframe where I can put all this informations without considering datastamps with missing values, and after that, how can set properly the df columns in the respective parameters of the TimeSeriesDataSet()
function.
Has anyone had a similar experience that might help me resolve this issue?