4

I'm working in python with pandas. I'm querying an external data source for unique users over a group of days (eg unique users per 30 days or per 2 days or per 7 days). The query needs a period start, a period end and an interval ( number of days).

For example to get unique users for jan 1st the params are :

  • start: 2015-01-01
  • end: 2015-01-01
  • interval: 1

For jan 1st to jan 3rd inclusively (3 days grouped):

  • start: 2015-01-01
  • end: 2015-01-03
  • interval: 3

Sometimes I also need multiple periods. in all cases a period can finish beyond the end date, but never start after the end.

For example 30 days uniques for jan 1st to jan 1st:

  • start: 2015-01-01
  • end: 2015-01-01
  • interval: 30

Ideally I would have used pandas.period_range(), but since it doesn't accept multiple in the freq parameter, I turned to pandas.date_range(). I'm iterating over the timestamps in the date range like so:

import pandas
start_date = "2015-01-01"
end_date = "2015-01-03"
interval = 3

for timestamp in pandas.date_range(start_date , end_date , freq=str(interval)+"D"):
    period_start = timesteamp.date()
    period_end = period_start + datetime.timedelta(days=interval)
    # query with period_start, period_end, interval

So when the last element of the range generated by pandas is after the end parameter it produce an unnecessary loop and unwanted query that I want to avoid.

The problem is calls like this:

pandas.date_range("2015-01-01", "2015-01-03", freq="3D")

return:

<class 'pandas.tseries.index.DatetimeIndex'>
[2015-01-01, 2015-01-04]
Length: 2, Freq: 3D, Timezone: None

With the last element 2015-01-04 being posterior to the end limit 2015-01-03. This doesn't happen when the frequency is shorter or longer than the actual period (in this case 2 days or 4 days):

>>> pandas.date_range("2015-01-01", "2015-01-03", freq="2D")
<class 'pandas.tseries.index.DatetimeIndex'>
[2015-01-01, 2015-01-03]
Length: 2, Freq: 2D, Timezone: None

>>> pandas.date_range("2015-01-01", "2015-01-03", freq="4D")
<class 'pandas.tseries.index.DatetimeIndex'>
[2015-01-01]
Length: 1, Freq: 4D, Timezone: None

I tried to understand the close parameter but the description is a bit cryptic for me. None of the three values ("right", "left", None) seem to give what I'm after.

I tested the 3 possible closed= values to observe the different outputs:

>>> pandas.date_range("2015-01-01", "2015-01-03", freq="3D", closed="left")
<class 'pandas.tseries.index.DatetimeIndex'>
[2015-01-01]
Length: 1, Freq: 3D, Timezone: None

This seems like the desired result. But in the case where the freq is longer than the time between start and end, it returns a range with 0 elements.

>>> pandas.date_range("2015-01-01", "2015-01-01", freq="10D", closed="left")
<class 'pandas.tseries.index.DatetimeIndex'>
Length: 0, Freq: 10D, Timezone: None

I'd expect :

<class 'pandas.tseries.index.DatetimeIndex'>
[2015-01-01]
Length: 1, Freq: 10D, Timezone: None

The only time I would see the result being empty is if the freq has a multiplier of 0 (ie "0D", "0H", "0W"). This already raises an error, so in practice date_range() should always return at least it's start value. If the end parameter occurs earlier than the end of the frequency, then the range would only contain one timestamp.

miradulo
  • 28,857
  • 6
  • 80
  • 93
Tristan Boudreault
  • 133
  • 1
  • 2
  • 11

2 Answers2

1

I think your interval semantics are a bit confused, and pandas is doing something reasonable here.

Consider your last example:

>>> pandas.date_range("2015-01-01", "2015-01-01", freq="10D", closed="left")

You write that you expect this:

<class 'pandas.tseries.index.DatetimeIndex'>
[2015-01-01]
Length: 1, Freq: 10D, Timezone: None

Consider a date such as 2015-01-04. On the one hand, this would be in the first (only) interval in this DatetimeIndex. On the other hand, this date would fall after the end date you supplied, which seems to contradict the specification of the interval in your original date_range call.

Or in other words, what does it mean to have a 30-day interval from January 1st to January 2nd?

If you want your interval index to automatically expand to an end date that encompasses the start date plus interval length, I think you will need to write a function to do it.

def nonempty_date_range(start, end, freq, closed=None):
    """ Return a pandas.DatetimeIndex containing at least one interval.  In some cases, the interval will extend beyond 'end'. """
    start = pandas.to_datetime(start)
    end = pandas.to_datetime(end)
    length = pandas.to_timedelta(freq)

    end = max(end, start + length)

    return pandas.date_range(start=start, end=end, freq=freq, closed=closed)

Results:

# Example from question
>>> pandas.date_range("2015-01-01", "2015-01-01", freq="10D", closed="left")
DatetimeIndex([], dtype='datetime64[ns]', freq='10D')

Function above:

>>> nonempty_date_range("2015-01-01", "2015-01-01", freq="10D", closed="left")
DatetimeIndex(['2015-01-01'], dtype='datetime64[ns]', freq='10D')
NicholasM
  • 4,557
  • 1
  • 20
  • 47
0

you can set up multiple date_range and use pandas.DatetimeIndex.union to union all datetime