How to create a single dataframe using TimeSeriesDataSet() to feed a TemporalFusionTransformer model?

Question

I am trying to use pytorch_forecasting TemporalFusionTransformer to predict oceanic environmental variables.

My data is organized as follows:

Location (sensor's locations):

loc_a
loc_b
loc_c

Ocean environment features/timeseries related to locations:

current
- velocity
- direction
sea_surface_height
- ssh
astronomical_tide
- at
simulated_current
- simulated_velocity
- simulated_direction
simulated_sea_surface_height
- simulated_ssh

Prediction target:

To proof the concept: current velocity from location_a
The dream: Current and ssh from all locations

Some charactheristics related to data:

time_varying_unknown_reals:
- current
- sea_surface_height
time_varying_known_reals:
- astronomical_tide
- simulated_current
- simulated_sea_surface_height
Some locations don't have all related environment timeseries
All the timeseries have missing data with different sizes of gaps

Just for testing, I tried to run a model only with the time_varying_unknown features represented by a single column labeled as value folowed by columns location, env_feature (current and sea_surface_height) and env_var (velocity, direction, ssh).

The TimeSeriesDataSet() configuration:

training = TimeSeriesDataSet(
    time_df[lambda x: x.hours_from_start <= training_cutoff],
    time_idx="hours_from_start",
    target='value',
    group_ids=['location', 'env_feature', 'env_var'],
    min_encoder_length=max_encoder_length // 2,
    max_encoder_length=max_encoder_length,
    min_prediction_length=1,
    max_prediction_length=max_prediction_length,
    static_categoricals=['location', 'env_feature', 'env_var'],
    time_varying_known_reals=["hours_from_start", "day",
                              "day_of_week", "month", 'hour'],
    time_varying_unknown_reals=['value'],
    target_normalizer=GroupNormalizer(groups=['location', 'env_feature', 'env_var']),
    #constant_fill_strategy={"value": 0.0},
    add_relative_time_idx=True,
    add_target_scales=True,
    add_encoder_length=True,
    allow_missing_timesteps=True,
)

A sample of related df.head(5):

	location	env_feature	env_var	value	hours_from_start	date	hour	day	day_of_week	month
0	barnabe	current	velocity	0.0724509	0	2020-01-01 00:00:00+00:00	0	1	2	2
1	barnabe	current	velocity	0.111463	1	2020-01-01 01:00:00+00:00	1	1	2	2
2	barnabe	current	velocity	0.0814537	2	2020-01-01 02:00:00+00:00	2	1	2	2
3	barnabe	current	velocity	0.0788815	3	2020-01-01 03:00:00+00:00	3	1	2	2
4	barnabe	current	velocity	0.187344	4	2020-01-01 04:00:00+00:00	4	1	2	2

Considering th example above, the dataset only accepts the unknown data. If you add all the known data and add a column with binary labels known_value and unknown_value, I don't know how it would be possible to set this information to the TimeSeriesDataSet() function.

I'm trying to think about how to create a single dataframe where I can put all this informations without considering datastamps with missing values, and after that, how can set properly the df columns in the respective parameters of the TimeSeriesDataSet() function.

Has anyone had a similar experience that might help me resolve this issue?

How to create a single dataframe using TimeSeriesDataSet() to feed a TemporalFusionTransformer model?

0 Answers0