0

I have a pandas Series in business-day frequency, and I want to resample it to weekly frequency where I take the product of those 5 days in a week.

Some dummy data:

dates = pd.bdate_range('2000-01-01', '2022-12-31')
s = pd.Series(np.random.uniform(size=len(dates)), index=dates)

# randomly assign NaN's
mask = np.random.randint(0, len(dates), round(len(dates)*.9))
s.iloc[mask] = np.nan

Notice that majority of this Series are NaN's.

The simple .prod method called after .resample is fast:

%timeit s.resample('W-FRI').prod()
10.2 ms ± 500 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

But I have to be very precise when taking the product in that I want to give min_count=1 when calling np.prod, and that's when it becomes very slow:

%timeit s.resample('W-FRI').apply(lambda x: x.prod(min_count=1))
69.1 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

I think the problem is not specific to np.prod but can be generalized to comparing all pandas-recognizable functions vs. applying custom functions.

How do I achieve a similar performance as .resample().prod() with min_count=1 argument?

data-monkey
  • 1,535
  • 3
  • 15
  • 24
  • `apply` is known to be slow, since it has to iterate at python level. The exceptions are (based on some other SO), when the function is a 'builtin pandas', or you can use the 'raw' mode, which does the work in `numpy`. – hpaulj Dec 14 '22 at 17:50
  • 2
    [`.resample().prod(min_count=1)`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.prod.html). See [implementation](https://i.stack.imgur.com/6VCg2.png) – Trenton McKinney Dec 14 '22 at 18:02
  • 1
    @TrentonMcKinney I feel embarassed that I shall have missed that. – data-monkey Dec 14 '22 at 18:22

1 Answers1

1

Until I saw Trenton McKinney's comment, I was going to propose:

def f(rs, min_count=0):
    res = rs.prod()
    res[rs.count() < min_count] = np.nan
    return res

%timeit f(s.resample('W-FRI'), min_count=1)
# same timing as s.resample('W-FRI').prod()

But Trenton's suggestion is far better:

s.resample('W-FRI').prod(min_count=1)

I'm only mentioning this for other cases when one would be tempted to use .apply(), but where using the resampling object a couple of times with builtin numpy functions is faster.

Pierre D
  • 24,012
  • 7
  • 60
  • 96