I'm working in python with pandas. I'm querying an external data source for unique users over a group of days (eg unique users per 30 days or per 2 days or per 7 days). The query needs a period start, a period end and an interval ( number of days).
For example to get unique users for jan 1st the params are :
- start: 2015-01-01
- end: 2015-01-01
- interval: 1
For jan 1st to jan 3rd inclusively (3 days grouped):
- start: 2015-01-01
- end: 2015-01-03
- interval: 3
Sometimes I also need multiple periods. in all cases a period can finish beyond the end date, but never start after the end.
For example 30 days uniques for jan 1st to jan 1st:
- start: 2015-01-01
- end: 2015-01-01
- interval: 30
Ideally I would have used pandas.period_range()
, but since it doesn't accept multiple in the freq
parameter, I turned to pandas.date_range()
. I'm iterating over the timestamps in the date range like so:
import pandas
start_date = "2015-01-01"
end_date = "2015-01-03"
interval = 3
for timestamp in pandas.date_range(start_date , end_date , freq=str(interval)+"D"):
period_start = timesteamp.date()
period_end = period_start + datetime.timedelta(days=interval)
# query with period_start, period_end, interval
So when the last element of the range generated by pandas is after the end parameter it produce an unnecessary loop and unwanted query that I want to avoid.
The problem is calls like this:
pandas.date_range("2015-01-01", "2015-01-03", freq="3D")
return:
<class 'pandas.tseries.index.DatetimeIndex'>
[2015-01-01, 2015-01-04]
Length: 2, Freq: 3D, Timezone: None
With the last element 2015-01-04 being posterior to the end limit 2015-01-03. This doesn't happen when the frequency is shorter or longer than the actual period (in this case 2 days or 4 days):
>>> pandas.date_range("2015-01-01", "2015-01-03", freq="2D")
<class 'pandas.tseries.index.DatetimeIndex'>
[2015-01-01, 2015-01-03]
Length: 2, Freq: 2D, Timezone: None
>>> pandas.date_range("2015-01-01", "2015-01-03", freq="4D")
<class 'pandas.tseries.index.DatetimeIndex'>
[2015-01-01]
Length: 1, Freq: 4D, Timezone: None
I tried to understand the close parameter but the description is a bit cryptic for me. None of the three values ("right", "left", None) seem to give what I'm after.
I tested the 3 possible closed=
values to observe the different outputs:
>>> pandas.date_range("2015-01-01", "2015-01-03", freq="3D", closed="left")
<class 'pandas.tseries.index.DatetimeIndex'>
[2015-01-01]
Length: 1, Freq: 3D, Timezone: None
This seems like the desired result. But in the case where the freq
is longer than the time between start and end, it returns a range with 0 elements.
>>> pandas.date_range("2015-01-01", "2015-01-01", freq="10D", closed="left")
<class 'pandas.tseries.index.DatetimeIndex'>
Length: 0, Freq: 10D, Timezone: None
I'd expect :
<class 'pandas.tseries.index.DatetimeIndex'>
[2015-01-01]
Length: 1, Freq: 10D, Timezone: None
The only time I would see the result being empty is if the freq
has a multiplier of 0 (ie "0D", "0H", "0W"). This already raises an error, so in practice date_range()
should always return at least it's start value. If the end parameter occurs earlier than the end of the frequency, then the range would only contain one timestamp.