0

I am trying to use pytorch_forecasting TemporalFusionTransformer to predict oceanic environmental variables.

My data is organized as follows:

Location (sensor's locations):

  • loc_a
  • loc_b
  • loc_c

Ocean environment features/timeseries related to locations:

  • current
    • velocity
    • direction
  • sea_surface_height
    • ssh
  • astronomical_tide
    • at
  • simulated_current
    • simulated_velocity
    • simulated_direction
  • simulated_sea_surface_height
    • simulated_ssh

Prediction target:

  • To proof the concept: current velocity from location_a
  • The dream: Current and ssh from all locations

Some charactheristics related to data:

  • time_varying_unknown_reals:
    • current
    • sea_surface_height
  • time_varying_known_reals:
    • astronomical_tide
    • simulated_current
    • simulated_sea_surface_height
  • Some locations don't have all related environment timeseries
  • All the timeseries have missing data with different sizes of gaps

Just for testing, I tried to run a model only with the time_varying_unknown features represented by a single column labeled as value folowed by columns location, env_feature (current and sea_surface_height) and env_var (velocity, direction, ssh).

The TimeSeriesDataSet() configuration:

training = TimeSeriesDataSet(
    time_df[lambda x: x.hours_from_start <= training_cutoff],
    time_idx="hours_from_start",
    target='value',
    group_ids=['location', 'env_feature', 'env_var'],
    min_encoder_length=max_encoder_length // 2,
    max_encoder_length=max_encoder_length,
    min_prediction_length=1,
    max_prediction_length=max_prediction_length,
    static_categoricals=['location', 'env_feature', 'env_var'],
    time_varying_known_reals=["hours_from_start", "day",
                              "day_of_week", "month", 'hour'],
    time_varying_unknown_reals=['value'],
    target_normalizer=GroupNormalizer(groups=['location', 'env_feature', 'env_var']),
    #constant_fill_strategy={"value": 0.0},
    add_relative_time_idx=True,
    add_target_scales=True,
    add_encoder_length=True,
    allow_missing_timesteps=True,
)

A sample of related df.head(5):

location env_feature env_var value hours_from_start days_from_start date hour day day_of_week month
0 barnabe current velocity 0.0724509 0 0 2020-01-01 00:00:00+00:00 0 1 2 2
1 barnabe current velocity 0.111463 1 0 2020-01-01 01:00:00+00:00 1 1 2 2
2 barnabe current velocity 0.0814537 2 0 2020-01-01 02:00:00+00:00 2 1 2 2
3 barnabe current velocity 0.0788815 3 0 2020-01-01 03:00:00+00:00 3 1 2 2
4 barnabe current velocity 0.187344 4 0 2020-01-01 04:00:00+00:00 4 1 2 2

Considering th example above, the dataset only accepts the unknown data. If you add all the known data and add a column with binary labels known_value and unknown_value, I don't know how it would be possible to set this information to the TimeSeriesDataSet() function.

I'm trying to think about how to create a single dataframe where I can put all this informations without considering datastamps with missing values, and after that, how can set properly the df columns in the respective parameters of the TimeSeriesDataSet() function.

Has anyone had a similar experience that might help me resolve this issue?

0 Answers0