Context
I'd like to create a time series (with pandas), to count distinct value of an Id if start and end date are within the considered date.
For sake of legibility, this is a simplified version of the problem.
Data
Let's define the Data this way:
df = pd.DataFrame({
'customerId': [
'1', '1', '1', '2', '2'
],
'id': [
'1', '2', '3', '1', '2'
],
'startDate': [
'2000-01', '2000-01', '2000-04', '2000-05', '2000-06',
],
'endDate': [
'2000-08', '2000-02', '2000-07', '2000-07', '2000-08',
],
})
And the period range this way:
period_range = pd.period_range(start='2000-01', end='2000-07', freq='M')
Objectives
For each customerId, there are several distinct id.
The final aim is to get, for each date
of the period-range, for each customerId
, the count of distinct id
whose start_date
and end_date
matches the function my_date_predicate
.
Simplified definition of my_date_predicate
:
unset_date = pd.to_datetime("1900-01")
def my_date_predicate(date, row):
return row.startDate <= date and \
(row.endDate.equals(unset_date) or row.endDate > date)
Awaited result
I'd like a time series result like this:
date customerId customerCount
0 2000-01 1 2
1 2000-01 2 0
2 2000-02 1 1
3 2000-02 2 0
4 2000-03 1 1
5 2000-03 2 0
6 2000-04 1 2
7 2000-04 2 0
8 2000-05 1 2
9 2000-05 2 1
10 2000-06 1 2
11 2000-06 2 2
12 2000-07 1 1
13 2000-07 2 0
Question
How could I use pandas to get such result?