0

I have time series data for Physical Activities. The data was recorded at 50hz frequency. But now I want to down sample the data at 20hz because I want to train and predict model at 20hz.

Is there an efficient way in python to do that ? I've heard of Panda's resample function but don't exactly know how can I use it efficiently for my problem. Any piece of code will be really helpful.

   epoch (ms)              time (10:00)  elapsed (s)  x-axis (g)  y-axis (g)  z-axis (g)
1613977400899   2021-02-22T12:03:20.899            0      -0.336       0.886       0.649
1613977400920   2021-02-22T12:03:20.920        0.021      -0.233       0.799       0.648
1613977400940   2021-02-22T12:03:20.940        0.041      -0.173       0.771       0.629
1613977400961   2021-02-22T12:03:20.961        0.062      -0.132       0.757       0.596
1613977400981   2021-02-22T12:03:20.981        0.082      -0.113       0.724       0.57
1613977401002   2021-02-22T12:03:21.002        0.103      -0.127       0.713       0.538
1613977401021   2021-02-22T12:03:21.021        0.122      -0.175       0.743       0.488
1613977401041   2021-02-22T12:03:21.041        0.142      -0.266       0.775       0.417
1613977401062   2021-02-22T12:03:21.062        0.163      -0.281       0.774       0.402
1613977401082   2021-02-22T12:03:21.082        0.183      -0.212       0.713       0.427
1613977401103   2021-02-22T12:03:21.103        0.204      -0.17        0.649       0.46
1613977401123   2021-02-22T12:03:21.123        0.224      -0.204       0.649       0.524
1613977401144   2021-02-22T12:03:21.144        0.245      -0.313       0.684       0.658
1613977401164   2021-02-22T12:03:21.164        0.265      -0.415       0.727       0.785
1613977401183   2021-02-22T12:03:21.183        0.284      -0.419       0.726       0.82
Cimbali
  • 11,012
  • 1
  • 39
  • 68
  • 1
    And “any piece of data will be really helpful” to help you solve this problem :) Seriously though, can you provide a few rows of your dataframe? At least the time column if there is one and some data columns. – Cimbali Jul 10 '21 at 11:39
  • Thanks for your response. I've added a screenshot of my dataframe. Kindly check – Arsalan Khan Jul 10 '21 at 11:49
  • Thanks for adding the data @ArsalanKhan. Please put a copy/paste of this data instead, screenshots are not useful at all, see [why not to post images of your data](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-errors-when-asking-a-question) – Cimbali Jul 10 '21 at 11:50
  • 1
    I was unable to paste the data frame here. But I've uploaded the CSV file on google drive and added the link to my question. – Arsalan Khan Jul 10 '21 at 11:59
  • 1
    it didn’t work because you included the data in the “edit reason” box not in the post body. I fixed that for you. – Cimbali Jul 10 '21 at 12:04
  • 1
    ahhh. Thank you so much – Arsalan Khan Jul 10 '21 at 12:05

1 Answers1

0

A main issue here seems to be that you original frequency is “roughly” 20ms (or 50Hz), not exactly. We’ll need to resample in 2 steps:

  1. Upsample to 1ms, where we can define which interpolation to use
  2. Downsample to 50ms (which is just picking one every 50 rows, so easy)

First let’s build a time index. Here you have the information twice, so either of these will work:

>>> df = df.set_index(df['epoch (ms)'].astype('datetime64[ms]'))
>>> df = df.set_index(pd.to_datetime(df['time (10:00)']))
>>> df
                            epoch (ms)             time (10:00)  elapsed (s)  x-axis (g)  y-axis (g)  z-axis (g)
time (10:00)                                                                                                    
2021-02-22 12:03:20.899  1613977400899  2021-02-22T12:03:20.899        0.000      -0.336       0.886       0.649
2021-02-22 12:03:20.920  1613977400920  2021-02-22T12:03:20.920        0.021      -0.233       0.799       0.648
2021-02-22 12:03:20.940  1613977400940  2021-02-22T12:03:20.940        0.041      -0.173       0.771       0.629
2021-02-22 12:03:20.961  1613977400961  2021-02-22T12:03:20.961        0.062      -0.132       0.757       0.596
2021-02-22 12:03:20.981  1613977400981  2021-02-22T12:03:20.981        0.082      -0.113       0.724       0.570
2021-02-22 12:03:21.002  1613977401002  2021-02-22T12:03:21.002        0.103      -0.127       0.713       0.538
2021-02-22 12:03:21.021  1613977401021  2021-02-22T12:03:21.021        0.122      -0.175       0.743       0.488
2021-02-22 12:03:21.041  1613977401041  2021-02-22T12:03:21.041        0.142      -0.266       0.775       0.417
2021-02-22 12:03:21.062  1613977401062  2021-02-22T12:03:21.062        0.163      -0.281       0.774       0.402
2021-02-22 12:03:21.082  1613977401082  2021-02-22T12:03:21.082        0.183      -0.212       0.713       0.427
2021-02-22 12:03:21.103  1613977401103  2021-02-22T12:03:21.103        0.204      -0.170       0.649       0.460
2021-02-22 12:03:21.123  1613977401123  2021-02-22T12:03:21.123        0.224      -0.204       0.649       0.524
2021-02-22 12:03:21.144  1613977401144  2021-02-22T12:03:21.144        0.245      -0.313       0.684       0.658
2021-02-22 12:03:21.164  1613977401164  2021-02-22T12:03:21.164        0.265      -0.415       0.727       0.785
2021-02-22 12:03:21.183  1613977401183  2021-02-22T12:03:21.183        0.284      -0.419       0.726       0.820

(Now we don’t really need the epoch and time columns any more, as the info is in the index)

Now we can do the resampling:

>>> df.resample('1ms').interpolate().resample('50ms').last()
                           epoch (ms)             time (10:00)  elapsed (s)  x-axis (g)  y-axis (g)  z-axis (g)
time (10:00)                                                                                                   
2021-02-22 12:03:20.850  1.613977e+12  2021-02-22T12:03:20.899        0.000   -0.336000    0.886000    0.649000
2021-02-22 12:03:20.900  1.613977e+12  2021-02-22T12:03:20.940        0.050   -0.155429    0.765000    0.614857
2021-02-22 12:03:20.950  1.613977e+12  2021-02-22T12:03:20.981        0.100   -0.125000    0.714571    0.542571
2021-02-22 12:03:21.000  1.613977e+12  2021-02-22T12:03:21.041        0.150   -0.271714    0.774619    0.411286
2021-02-22 12:03:21.050  1.613977e+12  2021-02-22T12:03:21.082        0.200   -0.178000    0.661190    0.453714
2021-02-22 12:03:21.100  1.613977e+12  2021-02-22T12:03:21.144        0.250   -0.338500    0.694750    0.689750
2021-02-22 12:03:21.150  1.613977e+12  2021-02-22T12:03:21.183        0.284   -0.419000    0.726000    0.820000

Note that you can do different types of interpolations, by specifying the argument you pass to .interpolate(). See the doc on this:

method : str, default ‘linear’
Interpolation technique to use. One of:

  • ‘linear’: Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes.
  • ‘time’: Works on daily and higher resolution data to interpolate given length of interval.
  • ‘index’, ‘values’: use the actual numerical values of the index.
  • ‘pad’: Fill in NaNs using existing values.
  • ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘spline’, ‘barycentric’, ‘polynomial’: Passed to scipy.interpolate.interp1d. These methods use the numerical values of the index. Both ‘polynomial’ and ‘spline’ require that you also specify an order (int), e.g. df.interpolate(method='polynomial', order=5).
  • ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’, ‘akima’, ‘cubicspline’: Wrappers around the SciPy interpolation methods of similar names. See Notes.
  • ‘from_derivatives’: Refers to scipy.interpolate.BPoly.from_derivatives which replaces ‘piecewise_polynomial’ interpolation method in scipy 0.18.

You can see slight differences in the coordinates, up to you to pick what the right method is for you:

>>> df.resample('1ms').interpolate('time').resample('50ms').last()
                           epoch (ms)             time (10:00)  elapsed (s)  x-axis (g)  y-axis (g)  z-axis (g)
time (10:00)                                                                                                   
2021-02-22 12:03:20.850  1.613977e+12  2021-02-22T12:03:20.899        0.000   -0.336000    0.886000    0.649000
2021-02-22 12:03:20.900  1.613977e+12  2021-02-22T12:03:20.940        0.050   -0.155429    0.765000    0.614857
2021-02-22 12:03:20.950  1.613977e+12  2021-02-22T12:03:20.981        0.100   -0.125000    0.714571    0.542571
2021-02-22 12:03:21.000  1.613977e+12  2021-02-22T12:03:21.041        0.150   -0.271714    0.774619    0.411286
2021-02-22 12:03:21.050  1.613977e+12  2021-02-22T12:03:21.082        0.200   -0.178000    0.661190    0.453714
2021-02-22 12:03:21.100  1.613977e+12  2021-02-22T12:03:21.144        0.250   -0.338500    0.694750    0.689750
2021-02-22 12:03:21.150  1.613977e+12  2021-02-22T12:03:21.183        0.284   -0.419000    0.726000    0.820000
>>> df.resample('1ms').interpolate('cubic').resample('50ms').last()
                           epoch (ms)             time (10:00)  elapsed (s)  x-axis (g)  y-axis (g)  z-axis (g)
time (10:00)                                                                                                   
2021-02-22 12:03:20.850  1.613977e+12  2021-02-22T12:03:20.899        0.000   -0.336000    0.886000    0.649000
2021-02-22 12:03:20.900  1.613977e+12  2021-02-22T12:03:20.940        0.050   -0.153162    0.766266    0.615219
2021-02-22 12:03:20.950  1.613977e+12  2021-02-22T12:03:20.981        0.100   -0.122950    0.711454    0.543581
2021-02-22 12:03:21.000  1.613977e+12  2021-02-22T12:03:21.041        0.150   -0.285487    0.781273    0.403123
2021-02-22 12:03:21.050  1.613977e+12  2021-02-22T12:03:21.082        0.200   -0.172478    0.656944    0.452494
2021-02-22 12:03:21.100  1.613977e+12  2021-02-22T12:03:21.144        0.250   -0.342439    0.695493    0.693425
2021-02-22 12:03:21.150  1.613977e+12  2021-02-22T12:03:21.183        0.284   -0.419000    0.726000    0.820000
Cimbali
  • 11,012
  • 1
  • 39
  • 68