1

I need to calculate the differences between consecutive time groups in data like the following

from io import StringIO

import pandas as pd

strio = StringIO("""\
               date  feat1         feat2  value
2016-10-15T00:00:00      1             1    0.0
2016-10-15T00:00:00      1             2    1.0
2016-10-15T00:00:00      2             1    2.0
2016-10-15T00:00:00      2             2    3.0
2016-10-15T00:01:00      1             1    8.0
2016-10-15T00:01:00      1             2    5.0
2016-10-15T00:02:00      1             1    8.0
2016-10-15T00:02:00      1             2   12.0
2016-10-15T00:02:00      2             1   10.0
2016-10-15T00:02:00      2             2   11.0
2016-10-15T00:03:00      1             1   12.0
2016-10-15T00:03:00      1             2   13.0
2016-10-15T00:03:00      2             1   14.0
2016-10-15T00:03:00      2             2   15.0""")

I can do this using xarray library

df = pd.read_table(strio, sep='\s+')
dims = df.columns.values[:3].tolist()
df.set_index(dims, inplace=True) # needed to convert to xarray dataset
dataset = df.to_xarray()
diff_time = dataset.diff(dim=dims[0]) # take the diff in time
print(diff_time.to_dataframe().reset_index())

prints

                   date  feat1  feat2  value
0   2016-10-15T00:01:00      1      1    8.0
1   2016-10-15T00:01:00      1      2    4.0
2   2016-10-15T00:01:00      2      1    NaN
3   2016-10-15T00:01:00      2      2    NaN
4   2016-10-15T00:02:00      1      1    0.0
5   2016-10-15T00:02:00      1      2    7.0
6   2016-10-15T00:02:00      2      1    NaN
7   2016-10-15T00:02:00      2      2    NaN
8   2016-10-15T00:03:00      1      1    4.0
9   2016-10-15T00:03:00      1      2    1.0
10  2016-10-15T00:03:00      2      1    4.0
11  2016-10-15T00:03:00      2      2    4.0

So in time instant 2016-10-15T00:01:00 that I have feat1:2 missing the relevant diffs are nan

How can I do this in pure pandas in a vectorized way? Constructing the original dataframe with nan fill-ins (so groups are equally sized) is an option but rather avoided

A clumsy way to do it would be:

dfs = []
for k, v in zip(itertools.islice(df.groupby(level=0).groups.values(), 1, None),
                df.groupby(level=0).groups.values()):
    # print(df.loc(axis=0)[k.values] , df.loc(axis=0)[v.values])
    diff = df.loc(axis=0)[k.values].reset_index(level=0, drop=True) - \
           df.loc(axis=0)[v.values].reset_index(level=0, drop=True)
    diff = pd.concat([diff], keys=[k.values[0][0]], names=['date'])
    dfs.append(diff)
print(pd.concat(dfs).reset_index())

It does print the same output but it is not vectorized

Mr_and_Mrs_D
  • 32,208
  • 39
  • 178
  • 361
  • Could you explain the logic behind the output? How are `feat1` and `feat2` relevant? Is every `value` 4 because you're trying to get the size of each `date` group? If so, is this close to what you're looking for: `df['value'] = df['date'].map(df.groupby('date').size())`? – Peter Leimbigler Jan 28 '19 at 15:09
  • Nope 4 is the difference between the same (feat1, feat2) subdataframes (each group that is). So this row `2016-10-15T00:01:00 1 1 4.0` minus that one (previous instance same feat1 and feat2) `2016-10-15T00:01:00 1 1 4.0` - when no feat1 or feat2 are present in next or previous instance I naturally get NaN – Mr_and_Mrs_D Jan 28 '19 at 15:19
  • Note to self: drop index: https://stackoverflow.com/a/17085044/281545 and add it back: https://stackoverflow.com/a/42094658/281545, to explain the loop code in the question – Mr_and_Mrs_D Jan 31 '19 at 16:59

1 Answers1

2

Updated solution:

df.unstack(0)['value']\
  .diff(axis=1)\
  .dropna(how='all', axis=1)\
  .unstack([0,1])\
  .rename('value')\
  .reset_index()

Output:

                   date  feat1  feat2  value
0   2016-10-15T00:01:00      1      1    8.0
1   2016-10-15T00:01:00      1      2    4.0
2   2016-10-15T00:01:00      2      1    NaN
3   2016-10-15T00:01:00      2      2    NaN
4   2016-10-15T00:02:00      1      1    0.0
5   2016-10-15T00:02:00      1      2    7.0
6   2016-10-15T00:02:00      2      1    NaN
7   2016-10-15T00:02:00      2      2    NaN
8   2016-10-15T00:03:00      1      1    4.0
9   2016-10-15T00:03:00      1      2    1.0
10  2016-10-15T00:03:00      2      1    4.0
11  2016-10-15T00:03:00      2      2    4.0

Details:

After creating a three level MultiIndex, first let's unstack level 0, date, which moves dates from rows to columns, then use diff on columns, lastly drop the the first date using dropna where the whole column is nan and unstack feat1 and feat2 to recreate multiindex and convert back to dataframe.

Community
  • 1
  • 1
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • 1
    Wow impressive - I am accepting while trying to digest (especially this `transform(lambda x: x.iloc[0]).unstack(0).diff(axis=1)` part, unstack escapes me) – Mr_and_Mrs_D Jan 28 '19 at 15:53
  • Let me add the steps into the solution. – Scott Boston Jan 28 '19 at 15:55
  • Basically, I am reshaping the dataframe that will let me use `diff` on columns easily and reshaping back to original form. – Scott Boston Jan 28 '19 at 16:02
  • Hmmm I was a bit in a hurry I modified the example a bit, it does not produce the correct results for more irregular dataframe - the `d1 = df.groupby(level=[0])['value'].transform(lambda x: x.iloc[0])` would need all values to appear, so I have to unaccept momentarily :P – Mr_and_Mrs_D Jan 28 '19 at 16:03
  • No problem. Yeah, if it gets more irregular it will become more complex. Such as missing dates, you will have to resample dates and etc... first. – Scott Boston Jan 28 '19 at 16:05
  • See modified df - no I mean more irregular values - missing dates can also be present – Mr_and_Mrs_D Jan 28 '19 at 16:06
  • Yeah... I think instead of groupby(level=[0]) only, let's try to groupby(level=[0,1,2]). Which, I think simplifies the problem a bit. We might not need groupby at all. – Scott Boston Jan 28 '19 at 16:10
  • Hopefully :) The loop solution I posted still prints the correct result if it helps any – Mr_and_Mrs_D Jan 28 '19 at 16:11
  • 1
    The only requirement is that dates are indeed ordered - but there may be gaps - loop solution still would work although clumsy as hell – Mr_and_Mrs_D Jan 28 '19 at 16:14
  • 1
    Yey - thanks - feel free to remove initial solution as it did not really solve it, could use some explanation for `unstack(0)` which is the gist of it - so this moves the 'date' level to the columns and then we diff the columns IIUC – Mr_and_Mrs_D Jan 28 '19 at 16:25