2

I've got a pandas dataframe like this. This contains a timestamp, id, foo and bar. The timestamp data is around every 10 minutes.

timestamp            id  foo  bar
2019-04-14 00:00:10  1   0.10 0.05
2019-04-14 00:10:02  1   0.30 0.10
2019-04-14 00:00:00  2   0.10 0.05
2019-04-14 00:10:00  2   0.30 0.10

For each id, I'd like to create 5 additional rows with timestamp split equally between successive rows, and foo & bar values containing random values between the successive rows.

The start time should be the earliest timestamp for each id and the end time should be the latest timestamp for each id

So the output would be like this.

timestamp            id  foo  bar
2019-04-14 00:00:10  1   0.10 0.05
2019-04-14 00:02:10  1   0.14 0.06
2019-04-14 00:04:10  1   0.11 0.06
2019-04-14 00:06:10  1   0.29 0.07
2019-04-14 00:08:10  1   0.22 0.09
2019-04-14 00:10:02  1   0.30 0.10
2019-04-14 00:00:00  2   0.80 0.50
2019-04-14 00:02:00  2   0.45 0.48
2019-04-14 00:04:00  2   0.52 0.42
2019-04-14 00:06:00  2   0.74 0.48
2019-04-14 00:08:00  2   0.41 0.45
2019-04-14 00:10:00  2   0.40 0.40

I can reindex the timestamp column and create additional timestamp rows (eg. Pandas create new date rows and forward fill column values).

But I can't seem to wrap my head around how to compute the random values for foo and bar between the successive rows.

Appreciate if someone can point me in the right direction!

Kvothe
  • 1,341
  • 7
  • 20
  • 33

2 Answers2

2

The close, what you need is use date_range with DataFrame.reindex by first and last value of DatetimeIndex:

df['timestamp'] = pd.to_datetime(df['timestamp'])

df = (df.set_index('timestamp')
        .groupby('id')['foo','bar']
        .apply(lambda x: x.reindex(pd.date_range(x.index[0], x.index[-1], periods=6))))

Then create helper DataFrame with same size like original and DataFrame.fillna missing values:

df1 = pd.DataFrame(np.random.rand(*df.shape), index=df.index, columns=df.columns)
df = df.fillna(df1)
print (df)
                                 foo       bar
id                                            
1  2019-04-14 00:00:10.000  0.100000  0.050000
   2019-04-14 00:02:08.400  0.903435  0.755841
   2019-04-14 00:04:06.800  0.956002  0.253878
   2019-04-14 00:06:05.200  0.388454  0.257639
   2019-04-14 00:08:03.600  0.225535  0.195306
   2019-04-14 00:10:02.000  0.300000  0.100000
2  2019-04-14 00:00:00.000  0.100000  0.050000
   2019-04-14 00:02:00.000  0.180865  0.327581
   2019-04-14 00:04:00.000  0.417956  0.414400
   2019-04-14 00:06:00.000  0.012686  0.800948
   2019-04-14 00:08:00.000  0.716216  0.941396
   2019-04-14 00:10:00.000  0.300000  0.100000
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

If the 'randomness' is not as crucial. We can use Series.interpolate which will keep the values between your min and max per group:

df_new = pd.concat([
    d.reindex(pd.date_range(d.timestamp.min(), d.timestamp.max(), periods=6))
    for _, d in df.groupby('id')
])
df_new['timestamp'] = df_new.index
df_new.reset_index(drop=True, inplace=True)

df_new = df_new[['timestamp']].merge(df, on='timestamp', how='left')
df_new['id'].fillna(method='ffill', inplace=True)

df_new[['foo', 'bar']] = df_new[['foo', 'bar']].apply(lambda x: x.interpolate())

Which gives the following output:

print(df_new)
                 timestamp   id   foo   bar
0  2019-04-14 00:00:10.000  1.0  0.10  0.05
1  2019-04-14 00:02:08.400  1.0  0.14  0.06
2  2019-04-14 00:04:06.800  1.0  0.18  0.07
3  2019-04-14 00:06:05.200  1.0  0.22  0.08
4  2019-04-14 00:08:03.600  1.0  0.26  0.09
5  2019-04-14 00:10:02.000  1.0  0.30  0.10
6  2019-04-14 00:00:00.000  2.0  0.10  0.05
7  2019-04-14 00:02:00.000  2.0  0.14  0.06
8  2019-04-14 00:04:00.000  2.0  0.18  0.07
9  2019-04-14 00:06:00.000  2.0  0.22  0.08
10 2019-04-14 00:08:00.000  2.0  0.26  0.09
11 2019-04-14 00:10:00.000  2.0  0.30  0.10
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • Hey @erfan thanks for this! the `randomness` isn't crucial. However, the `min` and `max` values should be based on successive rows. Whereas your solution provides random data based on min and max of the whole group. – Kvothe Apr 17 '19 at 00:05