I have a dataframe df
, which can be created with the following code:
import random
from datetime import timedelta
import pandas as pd
import datetime
#create test range of dates
rng=pd.date_range(datetime.date(2015,7,15),datetime.date(2015,7,31))
rnglist=rng.tolist()
testpts = range(100,121)
#create test dataframe
d={'jid':[i for i in range(100,121)],
'cid':[random.randint(1,2) for _ in testpts],
'ctid':[random.randint(3,4) for _ in testpts],
'stdt':[rnglist[random.randint(0,len(rng))] for _ in testpts]}
df=pd.DataFrame(d)[['jid','cid','ctid','stdt']]
df['enddt'] = df['stdt']+timedelta(days=random.randint(2,16))
The df
looks like this:
jid cid ctid stdt enddt
0 100 1 4 2015-07-28 2015-08-11
1 101 2 3 2015-07-31 2015-08-14
2 102 2 3 2015-07-31 2015-08-14
3 103 1 3 2015-07-24 2015-08-07
4 104 2 4 2015-07-27 2015-08-10
5 105 1 4 2015-07-27 2015-08-10
6 106 2 4 2015-07-24 2015-08-07
7 107 2 3 2015-07-22 2015-08-05
8 108 2 3 2015-07-28 2015-08-11
9 109 1 4 2015-07-20 2015-08-03
10 110 2 3 2015-07-29 2015-08-12
11 111 1 3 2015-07-29 2015-08-12
12 112 1 3 2015-07-27 2015-08-10
13 113 1 3 2015-07-21 2015-08-04
14 114 1 4 2015-07-28 2015-08-11
15 115 2 3 2015-07-28 2015-08-11
16 116 1 3 2015-07-26 2015-08-09
17 117 1 3 2015-07-25 2015-08-08
18 118 2 3 2015-07-26 2015-08-09
19 119 2 3 2015-07-19 2015-08-02
20 120 2 3 2015-07-22 2015-08-05
What I need to do is the following: Count (
cnt
) the number ofjid
that occur byctid
bycid
, for each date(newdate
) between themin(stdt)
andmax(enddt)
, where thenewdate
is between thestdt
and theenddt
.
That resulting DataFrame should look like (this is just for 1 cid
with 1 ctid
using above data)(this would replicate in this case for cid
1/ctid
4, cid
2/ctid
3, cid
2/ctid
4):
cid ctid newdate cnt
1 3 7/21/2015 1
1 3 7/22/2015 1
1 3 7/23/2015 1
1 3 7/24/2015 2
1 3 7/25/2015 3
1 3 7/26/2015 4
1 3 7/27/2015 5
1 3 7/28/2015 5
1 3 7/29/2015 6
1 3 7/30/2015 6
1 3 7/31/2015 6
1 3 8/1/2015 6
1 3 8/2/2015 6
1 3 8/3/2015 6
1 3 8/4/2015 6
1 3 8/5/2015 5
1 3 8/6/2015 5
1 3 8/7/2015 5
1 3 8/8/2015 4
1 3 8/9/2015 3
1 3 8/10/2015 2
1 3 8/11/2015 1
1 3 8/12/2015 1
This previous question (which was also mine) Count # of Rows Between Dates, was very similar, and was answered using pd.melt
. I am pretty sure melt
can be used again, or maybe there is a better option, but I can't figure out how to get the 'two layer groupby' accomplished which counts the size of jid
for each ctid
, for each cid
, for each newdate
. Love your inputs...