Create Pandas TimeSeries from Data, Period-range and aggregation function

Question

Context

I'd like to create a time series (with pandas), to count distinct value of an Id if start and end date are within the considered date.

For sake of legibility, this is a simplified version of the problem.

Data

Let's define the Data this way:

df = pd.DataFrame({
    'customerId': [
        '1', '1', '1', '2', '2'
    ],
    'id': [
        '1', '2', '3', '1', '2'
    ],
    'startDate': [
        '2000-01', '2000-01', '2000-04', '2000-05', '2000-06',
    ],
    'endDate': [
        '2000-08', '2000-02', '2000-07', '2000-07', '2000-08',
    ],
})

And the period range this way:

period_range = pd.period_range(start='2000-01', end='2000-07', freq='M')

Objectives

For each customerId, there are several distinct id. The final aim is to get, for each date of the period-range, for each customerId, the count of distinct id whose start_date and end_date matches the function my_date_predicate.

Simplified definition of my_date_predicate:

unset_date = pd.to_datetime("1900-01")


def my_date_predicate(date, row):
    return row.startDate <= date and \
           (row.endDate.equals(unset_date) or row.endDate > date)

Awaited result

I'd like a time series result like this:

        date customerId customerCount
0   2000-01          1             2
1   2000-01          2             0
2   2000-02          1             1
3   2000-02          2             0
4   2000-03          1             1
5   2000-03          2             0
6   2000-04          1             2
7   2000-04          2             0
8   2000-05          1             2
9   2000-05          2             1
10  2000-06          1             2
11  2000-06          2             2
12  2000-07          1             1
13  2000-07          2             0

Question

How could I use pandas to get such result?

you don't have any records with unset date in your sample data. Would you like to add some so one could see that one's code works well? — Roy2012, Jun 17 '20 at 07:36
Yes, for instance, we can replace the very last `end_date` with `unset_date` as test (and your solution does not work but if it is replaced priorly by a 'max date'). — Bsquare ℬℬ, Jun 17 '20 at 10:10

score 2 · Answer 1 · answered Jun 17 '20 at 09:39

Here's a solution:

df.startDate = pd.to_datetime(df.startDate)
df.endDate = pd.to_datetime(df.endDate)
df["month"] = df.apply(lambda row: pd.date_range(row["startDate"], row["endDate"], freq="MS", closed = "left"), axis=1)
df = df.explode("month")

period_range = pd.period_range(start='2000-01', end='2000-07', freq='M')

t = pd.DataFrame(period_range.to_timestamp(), columns=["month"])
customers_df = pd.DataFrame(df.customerId.unique(), columns = ["customerId"])
t = pd.merge(t.assign(dummy=1), customers_df.assign(dummy=1), on = "dummy").drop("dummy", axis=1)
t = pd.merge(t, df, on = ["customerId", "month"], how = "left")
t.groupby(["month", "customerId"]).count()[["id"]].rename(columns={"id": "count"})

The result is:

                       count
month      customerId       
2000-01-01 1               2
           2               0
2000-02-01 1               1
           2               0
2000-03-01 1               1
           2               0
2000-04-01 1               2
           2               0
2000-05-01 1               2
           2               1
2000-06-01 1               2
           2               2
2000-07-01 1               1
           2               1

Note:

For unset dates, replace the end date with the very last date you're interested in before you start the calculation.

Thanks for you very interesting answer, it is perfectly working with a little sample. I'm going to challenge it on true big sample today to ensure everything is ok ;) — Bsquare ℬℬ, Jun 17 '20 at 10:09
Thanks, your solution is working nicely, even with more Data. How could I create a nice plot graph => one line per Customer, with count of employees on Y axis, and month on X axis ? — Bsquare ℬℬ, Jun 18 '20 at 16:09
you have various plotting libraries. There's matplotlib which is highly popular. I've used plotly in the past and it worked great. — Roy2012, Jun 18 '20 at 17:17
If it answers the original question, do you mind accepting it as an answer? — Roy2012, Jun 19 '20 at 06:11

score 1 · Answer 2 · answered Jun 17 '20 at 18:39

You can do it with 2 pivot_table to get the count of id per customer in column per start date (and end date) in index. reindex each one with the period_date you are interested in. Substract the pivot for end from the pivot for start. Use cumsum to get the cumulative some of id per customer id. Finally use stack and reset_index to bring to the wanted shape.

#convert to period columns like period_date
df['startDate'] = pd.to_datetime(df['startDate']).dt.to_period('M')
df['endDate'] = pd.to_datetime(df['endDate']).dt.to_period('M')

#create the pivots
pvs = (df.pivot_table(index='startDate', columns='customerId', values='id', 
                      aggfunc='count', fill_value=0)
         .reindex(period_range, fill_value=0)
      )
pve = (df.pivot_table(index='endDate', columns='customerId', values='id', 
                      aggfunc='count', fill_value=0)
         .reindex(period_range, fill_value=0)
      )
print (pvs)
customerId  1  2
2000-01     2  0 #two id for customer 1 that start at this month
2000-02     0  0
2000-03     0  0
2000-04     1  0
2000-05     0  1 #one id for customer 2 that start at this month
2000-06     0  1
2000-07     0  0

Now you can substract one to the other and use cumsum to get the wanted amount per date.

res = (pvs - pve).cumsum().stack().reset_index()
res.columns = ['date', 'customerId','customerCount']
print (res)
       date customerId  customerCount
0   2000-01          1              2
1   2000-01          2              0
2   2000-02          1              1
3   2000-02          2              0
4   2000-03          1              1
5   2000-03          2              0
6   2000-04          1              2
7   2000-04          2              0
8   2000-05          1              2
9   2000-05          2              1
10  2000-06          1              2
11  2000-06          2              2
12  2000-07          1              1
13  2000-07          2              1

Note really sure how to handle the unset_date as I don't see what is used for

Thanks, should have time to test this today. In the meantime, don't hesitate to upvote the question ;) — Bsquare ℬℬ, Jun 18 '20 at 08:34