3

If I want to apply deep learning to the dataset from the sensors that I currently possess, I would require quite a lot data, or we may see overfitting. Unfortunately, the sensors have only been active for a month and therefore the data requires augmentation. I currently have data in the form of a dataframe that can be seen below:

index   timestamp              cas_pre        fl_rat         ...
0       2017-04-06 11:25:00    687.982849     1627.040283    ...
1       2017-04-06 11:30:00    693.427673     1506.217285    ...
2       2017-04-06 11:35:00    692.686310     1537.114807    ...
....
101003  2017-04-06 11:35:00    692.686310     1537.114807    ...

Now I want to augment some particular columns with the tsaug package. The augmentation can be in the form of:

my_aug = (    
    RandomMagnify(max_zoom=1.2, min_zoom=0.8) * 2
    + RandomTimeWarp() * 2
    + RandomJitter(strength=0.1) @ 0.5
    + RandomTrend(min_anchor=-0.5, max_anchor=0.5) @ 0.5
)

The docs for the augmentation library proceed to use the augmentation in the manner below:

X_aug, Y_aug = my_aug.run(X, Y)

Upong further investigation on this site, it seems as though that the augmentation affects numpy arrays. While it states that it is a multivariate augmentation not really sure as to how that is happening effectively.

I would like to apply this consistent augmentation across the float numerical columns such as cas_pre and fl_rat in order not to diverge from the original data and the relationships between each of the columns too much. I would not like to appply it rows such as timestamp. I am not sure as to how to do this within Pandas.

SDG
  • 2,260
  • 8
  • 35
  • 77
  • Are you able to share an example dataset (all of the columns, but just a few rows)? I'm not particularly up to speed on timeseries augmentation, but I'd assume that it creates new fake samples - so there will need to be new timestamps associated with these? – kabdulla Oct 26 '20 at 17:12

1 Answers1

2

This is my attempt:

#Convert Pandas dataframe to Numpy array and apply tsaug transformations

import numpy as np
import pandas as pd
from tsaug import TimeWarp, Crop, Quantize, Drift, Reverse

df = pd.DataFrame({"timestamp": [1, 2],"cas_pre": [687.982849, 693.427673], "fl_rat": [1627.040283, 1506.217285]})

my_aug = (    
    Drift(max_drift=(0.1, 0.5))
)

aug = my_aug.augment(df[["timestamp","cas_pre","fl_rat"]].to_numpy())

print("Input:")
print(df[["timestamp","cas_pre","fl_rat"]].to_numpy()) #debug
print("Output:")
print(aug)

Console Output:

Input:
[[1.00000000e+00 6.87982849e+02 1.62704028e+03]
 [2.00000000e+00 6.93427673e+02 1.50621728e+03]]
Output:
[[1.00000000e+00 9.13389853e+02 2.03588979e+03]
 [2.00000000e+00 1.01536282e+03 1.43177109e+03]]

You may need to convert your timestamps to something numeric.

The tsaug functions you use don't seem to exist, so I only applied drift() as an example. After some experimentation, TimeWarp() doesn't affect timestamps (Column 1) by default, but TimeWarp()*5 inserts new samples by cloning each timestamp 5 times.

Ruben Helsloot
  • 12,582
  • 6
  • 26
  • 49
  • so is it actually augmenting the data by the column? – SDG Oct 27 '20 at 06:10
  • Yes, in my example output column 1 is timestamps, column 2 is cas_pre & column 3 is f1_rat. Only columns 2 and 3 are modified. Unrelated, but after some experimentation TimeWarp() functions slightly differently than I initially thought. Will update answer. – Zack Henkusens Oct 27 '20 at 07:20
  • Yeah I ended up using a couple of bits from what you used and got some success, please update your answer in the meanwhile. – SDG Oct 27 '20 at 08:24
  • How does it work when timestamps are 'Timestamp' type not 'float' ? – parvaneh shayegh Jul 08 '21 at 08:42