2

I'm attempting to use pandas to manipulate some data and not seeming to find a built-in way to do resample my data to merge datasets with differing time indices.

It's not hard to do what I want using loops and such, but I'm trying to see if there's a means built into pandas to make it cleaner, perhaps using reindex or resample.

I have two data sets, each of which have data at somewhat regular intervals. One has data every ~10 seconds, the other every ~0.5 seconds. Note these are not perfectly regular in my data (e.g. could be every 9-11 seconds) and this must be preserved.

First I'd like to downsample the high resolution data such that I get the mean of all the high resolution samples that fall in the intervals between the low resolution samples.

Second I'd like to be able to upsample the low resolution data onto the high resolution time indices with interpolation.

I see lots of resample examples, but typically using regular time intervals (e.g. "10S) which won't work for me. I need to resample to irregular intervals defined by another series' index.

An example of the code, and a "manual" way to get the desired downsampling is below.

import pandas as pd
import numpy as np

#  high resolution data
index1 = pd.date_range('1/1/2000', periods=30, freq='S')
d = np.array([1 for i in range(30)])
d[12:23] = 8
hi_res_data = pd.Series(d, index=index1)

# low resolution data
index2 = pd.date_range('1/1/2000 00:00:00.5', periods=6, freq='5S')
low_res_data = pd.Series(range(100,106), index=index2)

#desired data is the average of the high resolution data samples
#falling within the intervals defined by low resolution data
# unrolled loop just to make it as clear as possible
desired_data = [
    0, #for simplicity sake, not concerned with edge effects
    hi_res_data[low_res_data.index[0]:low_res_data.index[1]].mean(),
    hi_res_data[low_res_data.index[1]:low_res_data.index[2]].mean(),
    hi_res_data[low_res_data.index[2]:low_res_data.index[3]].mean(),
    hi_res_data[low_res_data.index[3]:low_res_data.index[4]].mean(),
    hi_res_data[low_res_data.index[4]:low_res_data.index[5]].mean() 
]

1 Answers1

0

Annotated Code

# Use searchsorted to find the insertion points in low_res_data
# for timestamps from high_res_data which ensure preservation of the
# sorting order in low_res_data
i = np.searchsorted(low_res_data.index, hi_res_data.index)

# Identify the indices where timestamps fall out of the valid range
m = np.isin(i, [0, len(low_res_data)])

# Groupby and agg the high_res_data based on timestamps
# aligned with those in low_res_data
avg = hi_res_data[~m].groupby(low_res_data.index[i[~m]]).mean()

# Reindex to fill the missing timestamps
avg = avg.reindex(low_res_data.index, fill_value=0)

print(avg)

2000-01-01 00:00:00.500    0.0
2000-01-01 00:00:05.500    1.0
2000-01-01 00:00:10.500    1.0
2000-01-01 00:00:15.500    6.6
2000-01-01 00:00:20.500    8.0
2000-01-01 00:00:25.500    3.8
Freq: 5S, dtype: float64
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53