1

I have a CSV file that stores data from different smartphone sensors. The timestamps are elapsed nanoseconds since the program to record the data was started. Short example:

timestamps,acce_x,acce_y,acce_z,grav_x,grav_y,grav_z,labels
25993266,-2.5290375,6.9180603,4.3400116,-2.9009695,7.935462,4.978274,OTHER
28129496,-2.5290375,6.9180603,4.3400116,-2.87558475,7.87134935,5.091722799999999,OTHER
31028666,-2.53741455,6.9312286499999995,4.605766300000001,-2.8502,7.8072367,5.2051716,OTHER
33164897,-2.5457916,6.944397,4.871521,-2.79687885,7.73525185,5.3374355,OTHER
36064067,-2.4727707,6.91207125,5.1803741500000005,-2.7435577,7.663267,5.4696994,OTHER
38200297,-2.3997498,6.8797455,5.4892273,-2.6648885,7.59296125,5.6024062,OTHER
41099467,-2.25849155,6.85580445,5.79090115,-2.5862193,7.5226555,5.735113,OTHER
43235697,-2.1172333,6.8318634,6.092575,-2.50272225,7.45811375,5.85305635,OTHER
46134867,-1.9903412,6.810318,6.32122035,-2.4192252,7.393572,5.9709997,OTHER

The time steps between the timestamps are not equal, but I would like them to be. My question is how to achieve this? I was thinking about simply downsampling the nanoseconds to microseconds using the code below. This is my first attempt that does not return an error during execution, but it returns a CSV file without the timestamps and every row after the first is completely empty.

series = pandas.read_csv("file3.csv", header=0, index_col=0, squeeze=True, nrows=1000)
series.index = pandas.to_datetime(series.index, unit='ns')
downsampled = series.resample("U").mean()
downsampled.to_csv("file4.csv", index=False)

I would be thankful for ways to improve my code as well as other ideas to achieve my goal in general.

lapplapp
  • 51
  • 5

2 Answers2

1

I think this line is incorrect

series.index = pandas.to_datetime(series.index, unit='ns')

should be using timestamps instead of index

series.index = pandas.to_datetime(series.timestamps, unit='ns')

here's the result

                             timestamps  acce_x  acce_y  acce_z  grav_x  grav_y  grav_z
timestamps
1970-01-01 00:00:00.025993 25993266.000  -2.529   6.918   4.340  -2.901   7.935   4.978
1970-01-01 00:00:00.025994          NaN     NaN     NaN     NaN     NaN     NaN     NaN
1970-01-01 00:00:00.025995          NaN     NaN     NaN     NaN     NaN     NaN     NaN
1970-01-01 00:00:00.025996          NaN     NaN     NaN     NaN     NaN     NaN     NaN
1970-01-01 00:00:00.025997          NaN     NaN     NaN     NaN     NaN     NaN     NaN
...                                 ...     ...     ...     ...     ...     ...     ...
1970-01-01 00:00:00.046130          NaN     NaN     NaN     NaN     NaN     NaN     NaN
1970-01-01 00:00:00.046131          NaN     NaN     NaN     NaN     NaN     NaN     NaN
1970-01-01 00:00:00.046132          NaN     NaN     NaN     NaN     NaN     NaN     NaN
1970-01-01 00:00:00.046133          NaN     NaN     NaN     NaN     NaN     NaN     NaN
1970-01-01 00:00:00.046134 46134867.000  -1.990   6.810   6.321  -2.419   7.394   5.971
Jonathan Leon
  • 5,440
  • 2
  • 6
  • 14
1

When you resample over milliseconds, there aren't enough values to fill consecutive buckets, so you end up with NaN's.

If you want your timesteps to be equal while also having all buckets filled, you can find the maximum difference and use that as the resampling rate:

First, set the index to be Timedelta's, since it's the time elapsed since the app started.

df.index = df.index.map(lambda t: pd.Timedelta(t, unit='ns'))
df.index

# output:
TimedeltaIndex(['0 days 00:00:00.025993266', '0 days 00:00:00.028129496',
                '0 days 00:00:00.031028666', '0 days 00:00:00.033164897',
                '0 days 00:00:00.036064067', '0 days 00:00:00.038200297',
                '0 days 00:00:00.041099467', '0 days 00:00:00.043235697',
                '0 days 00:00:00.046134867'],
               dtype='timedelta64[ns]', name='timestamps', freq=None)

Next, resampling:

import numpy as np

max_diff = np.diff(df.index).max()
# numpy.timedelta64(2899170,'ns')

# convert to pandas.Timedelta to use it with `resample`
dfr = df.resample(pd.Timedelta(max_diff)).mean()
dfr

Output:

                             acce_x    acce_y    acce_z    grav_x    grav_y    grav_z
timestamps                                                                           
0 days 00:00:00.025993266 -2.529037  6.918060  4.340012 -2.888277  7.903406  5.034998
0 days 00:00:00.028892436 -2.537415  6.931229  4.605766 -2.850200  7.807237  5.205172
0 days 00:00:00.031791606 -2.545792  6.944397  4.871521 -2.796879  7.735252  5.337435
0 days 00:00:00.034690776 -2.472771  6.912071  5.180374 -2.743558  7.663267  5.469699
0 days 00:00:00.037589946 -2.399750  6.879746  5.489227 -2.664888  7.592961  5.602406
0 days 00:00:00.040489116 -2.187862  6.843834  5.941738 -2.544471  7.490385  5.794085
0 days 00:00:00.043388286 -1.990341  6.810318  6.321220 -2.419225  7.393572  5.971000

And to verify that your index is evenly spaced, it has freq='2899170N':

dfr.index
# output:
TimedeltaIndex(['0 days 00:00:00.025993266', '0 days 00:00:00.028892436',
                '0 days 00:00:00.031791606', '0 days 00:00:00.034690776',
                '0 days 00:00:00.037589946', '0 days 00:00:00.040489116',
                '0 days 00:00:00.043388286'],
               dtype='timedelta64[ns]', name='timestamps', freq='2899170N')

Or check via diff:

np.diff(dfr.index)
# output:
array([2899170, 2899170, 2899170, 2899170, 2899170, 2899170],
      dtype='timedelta64[ns]')
aneroid
  • 12,983
  • 3
  • 36
  • 66
  • Btw, what you could do instead is: resample in millisecond-intervals with forward fill (`ffill`) and then resample again with a larger intervals and a mean. Would end up with better results and not "lose" records. – aneroid Jan 10 '21 at 01:03
  • As simple as `re_milli = df.resample("U").fillna(ffill)` and then `re_mean = re_milli.resample("10U").mean()` ? Because after doing so, my data looks odd... many identical values and in between one single different one. – lapplapp Jan 10 '21 at 17:30
  • Yeah, with so few records (at least from the question sample) you either lose some to get sensible `mean()` values (per my answer) or you get multiple (100x) records with many repetitions (per my comment). Otherwise, you'd just get NA's and then need to pick one of these two options. Based on your sample records, you could probably resample every '0.003s' or '3ms'. – aneroid Jan 10 '21 at 18:29
  • 1
    Btw, `'U'` is _micro_ seconds which is too fine grained. _Milli_ seconds should be fine for you (`'L'` / 'ms'). I calculated the max difference and used that as the resampling rate (rather than just gues). 2899170 nanoseconds = 2.89 milliseconds ≈ 3ms. – aneroid Jan 10 '21 at 18:35