3

Context

I'd like to create a time series (with pandas), to count distinct value of an Id if start and end date are within the considered date.

For sake of legibility, this is a simplified version of the problem.

Data

Let's define the Data this way:

df = pd.DataFrame({
    'customerId': [
        '1', '1', '1', '2', '2'
    ],
    'id': [
        '1', '2', '3', '1', '2'
    ],
    'startDate': [
        '2000-01', '2000-01', '2000-04', '2000-05', '2000-06',
    ],
    'endDate': [
        '2000-08', '2000-02', '2000-07', '2000-07', '2000-08',
    ],
})

And the period range this way:

period_range = pd.period_range(start='2000-01', end='2000-07', freq='M')

Objectives

For each customerId, there are several distinct id. The final aim is to get, for each date of the period-range, for each customerId, the count of distinct id whose start_date and end_date matches the function my_date_predicate.

Simplified definition of my_date_predicate:

unset_date = pd.to_datetime("1900-01")


def my_date_predicate(date, row):
    return row.startDate <= date and \
           (row.endDate.equals(unset_date) or row.endDate > date)

Awaited result

I'd like a time series result like this:

        date customerId customerCount
0   2000-01          1             2
1   2000-01          2             0
2   2000-02          1             1
3   2000-02          2             0
4   2000-03          1             1
5   2000-03          2             0
6   2000-04          1             2
7   2000-04          2             0
8   2000-05          1             2
9   2000-05          2             1
10  2000-06          1             2
11  2000-06          2             2
12  2000-07          1             1
13  2000-07          2             0

Question

How could I use pandas to get such result?

Bsquare ℬℬ
  • 4,423
  • 11
  • 24
  • 44

2 Answers2

2

Here's a solution:

df.startDate = pd.to_datetime(df.startDate)
df.endDate = pd.to_datetime(df.endDate)
df["month"] = df.apply(lambda row: pd.date_range(row["startDate"], row["endDate"], freq="MS", closed = "left"), axis=1)
df = df.explode("month")

period_range = pd.period_range(start='2000-01', end='2000-07', freq='M')

t = pd.DataFrame(period_range.to_timestamp(), columns=["month"])
customers_df = pd.DataFrame(df.customerId.unique(), columns = ["customerId"])
t = pd.merge(t.assign(dummy=1), customers_df.assign(dummy=1), on = "dummy").drop("dummy", axis=1)
t = pd.merge(t, df, on = ["customerId", "month"], how = "left")
t.groupby(["month", "customerId"]).count()[["id"]].rename(columns={"id": "count"})

The result is:

                       count
month      customerId       
2000-01-01 1               2
           2               0
2000-02-01 1               1
           2               0
2000-03-01 1               1
           2               0
2000-04-01 1               2
           2               0
2000-05-01 1               2
           2               1
2000-06-01 1               2
           2               2
2000-07-01 1               1
           2               1

Note:

  • For unset dates, replace the end date with the very last date you're interested in before you start the calculation.
Roy2012
  • 11,755
  • 2
  • 22
  • 35
  • Thanks for you very interesting answer, it is perfectly working with a little sample. I'm going to challenge it on true big sample today to ensure everything is ok ;) – Bsquare ℬℬ Jun 17 '20 at 10:09
  • great. Please note my last note about unset dates. – Roy2012 Jun 17 '20 at 10:10
  • Thanks, your solution is working nicely, even with more Data. How could I create a nice plot graph => one line per Customer, with count of employees on Y axis, and month on X axis ? – Bsquare ℬℬ Jun 18 '20 at 16:09
  • you have various plotting libraries. There's matplotlib which is highly popular. I've used plotly in the past and it worked great. – Roy2012 Jun 18 '20 at 17:17
  • If it answers the original question, do you mind accepting it as an answer? – Roy2012 Jun 19 '20 at 06:11
1

You can do it with 2 pivot_table to get the count of id per customer in column per start date (and end date) in index. reindex each one with the period_date you are interested in. Substract the pivot for end from the pivot for start. Use cumsum to get the cumulative some of id per customer id. Finally use stack and reset_index to bring to the wanted shape.

#convert to period columns like period_date
df['startDate'] = pd.to_datetime(df['startDate']).dt.to_period('M')
df['endDate'] = pd.to_datetime(df['endDate']).dt.to_period('M')

#create the pivots
pvs = (df.pivot_table(index='startDate', columns='customerId', values='id', 
                      aggfunc='count', fill_value=0)
         .reindex(period_range, fill_value=0)
      )
pve = (df.pivot_table(index='endDate', columns='customerId', values='id', 
                      aggfunc='count', fill_value=0)
         .reindex(period_range, fill_value=0)
      )
print (pvs)
customerId  1  2
2000-01     2  0 #two id for customer 1 that start at this month
2000-02     0  0
2000-03     0  0
2000-04     1  0
2000-05     0  1 #one id for customer 2 that start at this month
2000-06     0  1
2000-07     0  0

Now you can substract one to the other and use cumsum to get the wanted amount per date.

res = (pvs - pve).cumsum().stack().reset_index()
res.columns = ['date', 'customerId','customerCount']
print (res)
       date customerId  customerCount
0   2000-01          1              2
1   2000-01          2              0
2   2000-02          1              1
3   2000-02          2              0
4   2000-03          1              1
5   2000-03          2              0
6   2000-04          1              2
7   2000-04          2              0
8   2000-05          1              2
9   2000-05          2              1
10  2000-06          1              2
11  2000-06          2              2
12  2000-07          1              1
13  2000-07          2              1

Note really sure how to handle the unset_date as I don't see what is used for

Ben.T
  • 29,160
  • 6
  • 32
  • 54