8

I have a dataframe df which can be created with this:

data={'id':[1,1,1,1,2,2,2,2],
      'date1':[datetime.date(2016,1,1),datetime.date(2016,1,2),datetime.date(2016,1,3),datetime.date(2016,1,4),
               datetime.date(2016,1,2),datetime.date(2016,1,4),datetime.date(2016,1,3),datetime.date(2016,1,1)],
      'date2':[datetime.date(2016,1,5),datetime.date(2016,1,3),datetime.date(2016,1,5),datetime.date(2016,1,5),
               datetime.date(2016,1,4),datetime.date(2016,1,5),datetime.date(2016,1,4),datetime.date(2016,1,1)],
      'score1':[5,7,3,2,9,3,8,3],
      'score2':[1,3,0,5,2,20,7,7]}
df=pd.DataFrame.from_dict(data)

And looks like this:
   id       date1       date2  score1  score2
0   1  2016-01-01  2016-01-05       5       1
1   1  2016-01-02  2016-01-03       7       3
2   1  2016-01-03  2016-01-05       3       0
3   1  2016-01-04  2016-01-05       2       5
4   2  2016-01-02  2016-01-04       9       2
5   2  2016-01-04  2016-01-05       3      20
6   2  2016-01-03  2016-01-04       8       7
7   2  2016-01-01  2016-01-01       3       7

What I need to do is create a column for each of score1 and score2, which creates two columns which SUM the values of score1 and score2 respectively, based on whether the usedate is between date1 and date2. usedate is created by getting all dates between and including the date1 minimum and the date2 maximum. I used this to create the date range:

drange=pd.date_range(df.date1.min(),df.date2.max())    

The resulting dataframe newdf should look like:

     usedate  score1sum  score2sum
0 2016-01-01          8          8
1 2016-01-02         21          6
2 2016-01-03         32         13
3 2016-01-04         30         35
4 2016-01-05         13         26

For clarification, on usedate 2016-01-01, score1sum is 8, which is calculated by looking at the rows in df where 2016-01-01 is between and including date1 and date2, which sum row0(5) and row8(3). On usedate 2016-01-04, score2sum is 35, which is calculated by looking at the rows in df where 2016-01-04 is between and including date1 and date2, which sum row0(1), row3(0), row4(5), row5(2), row6(20), row7(7).

Maybe some kind of groupby, or melt then groupby?

clg4
  • 2,863
  • 6
  • 27
  • 32
  • Do you count rows from 0 or 1? There is a confusion in your second example. And in the first example I don't see that row8 is 3. Back to second example, how come `2016-01-04` is between `2017-05-28 and 2017-09-22`, and between `2015-11-01 and 2015-11-09`? – FatihAkici Jan 04 '18 at 21:42
  • Also, your minimum date (`df.date1.min()`) is `2015-11-01`, how can your `usedate` column start from `2016-01-01`? – FatihAkici Jan 04 '18 at 21:54
  • @Fatih Akici, much apologies. I had recreated the sample dataframe with fewer dates, and forgot to change what I originally had. I have updated the question with the correct sample. – clg4 Jan 04 '18 at 22:41

3 Answers3

6

You can use apply with lambda function:

df['date1'] = pd.to_datetime(df['date1'])

df['date2'] = pd.to_datetime(df['date2'])

df1 = pd.DataFrame(index=pd.date_range(df.date1.min(), df.date2.max()), columns = ['score1sum', 'score2sum'])

df1[['score1sum','score2sum']] = df1.apply(lambda x: df.loc[(df.date1 <= x.name) & 
                                                            (x.name <= df.date2),
                                                            ['score1','score2']].sum(), axis=1)

df1.rename_axis('usedate').reset_index()

Output:

     usedate  score1sum  score2sum
0 2016-01-01          8          8
1 2016-01-02         21          6
2 2016-01-03         32         13
3 2016-01-04         30         35
4 2016-01-05         13         26
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • what is the x.name above referring to? – clg4 Jan 15 '18 at 17:32
  • 1
    x.name is refering to the column the apply is working on. Apply takes each column of a dataframe at a time. Determine which column are you are working you can use x.name to that series name. – Scott Boston Jan 15 '18 at 17:35
  • im confused, can you please apply the actual column name you used to get the output? I assumed your code would apply to both score1 and score2. – clg4 Jan 15 '18 at 17:39
  • I am sorry.... since we are using apply with axis=1, x.name is referring the row index of the dataframe we are working on. So, for row1 x.name will be the first date in the index. Apply works on a column or a row at time time depending on the axis. So, each of those are passed in as 'pd.Series' with the index as the name. – Scott Boston Jan 15 '18 at 17:42
  • No. x.name in this case is referring to the usedate we are building in the dataframe. So when you apply with axis=1. Each row is passed in as a series, the series index are the column headers from the dataframe and the name of the series is the row index of that dataframe. – Scott Boston Jan 15 '18 at 17:50
  • this works, choosing this but want to see if the below is faster, waiting on @Peter Leimbigler to respond to using transform with a series... – clg4 Jan 15 '18 at 18:30
1

Method 1: list comprehensions

This is inelegant, but hey, it works! (EDIT: added a second method below.)

# Convert datetime.date to pandas timestamps for easier comparisons
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])

# solution
newdf = pd.DataFrame(data=drange, columns=['usedate'])
# for each usedate ud, get all df rows whose dates contain ud,
# then sum the scores of these rows
newdf['score1sum'] = [df[(df['date1'] <= ud) & (df['date2'] >= ud)]['score1'].sum() for ud in drange]
newdf['score2sum'] = [df[(df['date1'] <= ud) & (df['date2'] >= ud)]['score2'].sum() for ud in drange]

# output
newdf
     usedate  score1sum  score2sum
  2016-01-01          8          8
  2016-01-02         21          6
  2016-01-03         32         13
  2016-01-04         30         35
  2016-01-05         13         26

Method 2: a helper function with transform (or apply)

newdf = pd.DataFrame(data=drange, columns=['usedate'])

def sum_scores(d):
    return df[(df['date1'] <= d) & (df['date2'] >= d)][['score1', 'score2']].sum()

# apply works here too, and is about equally fast in my testing
newdf[['score1sum', 'score2sum']] = newdf['usedate'].transform(sum_scores)

# newdf is same to above

Timings are comparable

# Jupyter timeit cell magic
%%timeit 
newdf['score1sum'] = [df[(df['date1'] <= d) & (df['date2'] >= d)]['score1'].sum() for d in drange]
newdf['score1sum'] = [df[(df['date1'] <= d) & (df['date2'] >= d)]['score2'].sum() for d in drange]

100 loops, best of 3: 10.4 ms per loop

# Jupyter timeit line magic
%timeit newdf[['score1sum', 'score2sum']] = newdf['usedate'].transform(sum_scores) 

100 loops, best of 3: 8.51 ms per loop
Peter Leimbigler
  • 10,775
  • 1
  • 23
  • 37
0

conditional_join from pyjanitor may be helpful in the abstraction/convenience:

# pip install pyjanitor
import pandas as pd
import janitor as jn

drange = pd.DataFrame(drange, columns=['dates'])
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])

(drange.conditional_join(df, 
                         ('dates', 'date1', '>='), 
                         ('dates', 'date2', '<='))
.droplevel(0, 1)
.select_columns('dates', 'score*')
.groupby('dates')
.sum()
.add_suffix('num')
) 
            score1num  score2num
dates                           
2016-01-01          8          8
2016-01-02         21          6
2016-01-03         32         13
2016-01-04         30         35
2016-01-05         13         26

sammywemmy
  • 27,093
  • 4
  • 17
  • 31