3

I have a pandas dataframe school_df that looks like this:

    school_id  date_posted date_completed
0    A          2014-01-01  2014-01-01
1    A          2014-01-01  2014-01-08
2    A          2014-04-29  2014-05-01
3    B          2014-01-01  2014-01-01
4    B          2014-01-20  2014-02-23

Each row represents one project by that school. I'd like to add two columns: for each unique school_id, a count of how many projects were posted before that date and a count of how many projects were completed before that date.

The code below works, but I have ~300,000 unique schools, so it's taking a long time to run. Is there a faster way to get what I am looking for? Thank you for your assistance!

import pandas as pd
groups = school_df.groupby("school_id")
blank_df = pd.DataFrame()
for g, df in groups:
    df['school_previous_projects'] = df.date_posted.map(lambda x: len(df[df.date_posted < x]))
    df['school_previous_completed'] = df.date_posted.map(lambda x: len(df[df.date_completed < x]))
    blank_df = pd.concat([blank_df, df])
cottontail
  • 10,268
  • 18
  • 50
  • 51
Erin
  • 33
  • 2
  • @BobHaffner has a pretty good answer. Thinking outside the box, you could groupby school and set indexes on the date columns one at a time. Then you can use the rolling counts because it will be sorted by date. That will be much faster than using the apply method and checking len for each row. Check out cumcount http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.cumcount.html – Brian Pendleton Sep 06 '15 at 12:31
  • I agree with @BrianPendleton My method might be faster than yours, but there might be a better way. – Bob Haffner Sep 06 '15 at 13:43

2 Answers2

2

Give this a try. Should be faster than your for loop and two maps. Starting with your frame

    school_id  date_posted date_completed
0    A          2014-01-01  2014-01-01
1    A          2014-01-01  2014-01-08
2    A          2014-04-29  2014-05-01
3    B          2014-01-01  2014-01-01
4    B          2014-01-20  2014-02-23

Then a function. getProjectCounts() uses boolean indexing and a simple count()

def getProjectCounts(row, df):
    filter = (df["school_id"] == row["school_id"])  & (df["date_posted"] < row["date_posted"])
    dp_count = df[filter]["date_posted"].count()
    filter = (df["school_id"] == row["school_id"])  & (df["date_completed"] < row["date_completed"])
    dc_count = df[filter]["date_completed"].count()
    return pd.Series([dp_count, dc_count])

then an apply() with the function to go row by row

school_df[["school_previous_projects","school_previous_completed"]] = school_df.apply(lambda x : getProjectCounts(x, school_df),axis=1)


  school_id date_posted date_completed  school_previous_projects  \
0         A  2014-01-01     2014-01-01                         0   
1         A  2014-01-01     2014-01-08                         0   
2         A  2014-04-29     2014-05-01                         2   
3         B  2014-01-01     2014-01-01                         0   
4         B  2014-01-20     2014-02-23                         1   

   school_previous_completed  
0                          0  
1                          1  
2                          2  
3                          0  
4                          1 
Bob Haffner
  • 8,235
  • 1
  • 36
  • 43
1

Here is a version using cumcount (I simplified the dates, but still should work):

import pandas as pd
import io


df = pd.DataFrame({'school_id': ['A', 'A', 'A', 'B', 'B'],
                   'date_posted': pd.date_range('2014-01-01', '2014-01-05'),
                   'date_completed': pd.date_range('2014-01-01', '2014-01-05')})

posted = df.set_index('date_posted').groupby('school_id').cumcount()
comp = df.set_index('date_completed').groupby('school_id').cumcount()

df['posted'] = posted.values
df['comp'] = comp.values

print df

Results in:

  date_completed date_posted school_id  posted  comp 
0     2014-01-01  2014-01-01         A       0     0 
1     2014-01-02  2014-01-02         A       1     1 
2     2014-01-03  2014-01-03         A       2     2 
3     2014-01-04  2014-01-04         B       0     0 
4     2014-01-05  2014-01-05         B       1     1 
Brian Pendleton
  • 839
  • 4
  • 13