Pandas filter performing slow due to too many groups

Question

I have a large df which I need to alter. For one computation I want to group my data on pborderid and then only keep the rows which have a duplicate value in to_wpadr.

Here is an example of my df.

     pborderid   pbcarid wsid  to_wpadr  colli pk_end_time            
10   76079450  61838497  hp1  523-369p      1 2016-07-01 00:00:38     
11   76079450  61838504  hp1  523-370p      1 2016-07-01 00:00:47     
12   76079450  61838110  hp1  523-372p      1 2016-07-01 00:01:05     
13   76079450  61838225  hp1  523-372p      2 2016-07-01 00:01:13      
14   76079450  61838504  hp1  523-372p      3 2016-07-01 00:01:30     
15   76079450  61838497  hp1  523-373p      1 2016-07-01 00:01:45      
16   76079450  61838504  hp1  523-377p      1 2016-07-01 00:01:55      
17   76079450  61838110  hp1  523-376p      5 2016-07-01 00:02:26      
18   76079450  61838225  hp1  523-376p      1 2016-07-01 00:02:33     
19   76079450  61838497  hp1  523-376p      6 2016-07-01 00:02:55

Using the following code works, however when the number of groups increases the perfomance decreases significantly, taking up to 20 seconds. I think it should be doable by only using pborderid as group and then filter/apply.

ap=ot.groupby(["pborderid","to_wpadr"],sort=False).filter(lambda x: len(x) >1)

Anyone have any suggestions how I can only use one groupcolumn or how I can improve the performance otherwise?

Here is the wanted outcome:

    pborderid   pbcarid wsid  to_wpadr  colli         pk_end_time 
12   76079450  61838110  hp1  523-372p      1 2016-07-01 00:01:05      
13   76079450  61838225  hp1  523-372p      2 2016-07-01 00:01:13        
14   76079450  61838504  hp1  523-372p      3 2016-07-01 00:01:30       
17   76079450  61838110  hp1  523-376p      5 2016-07-01 00:02:26        
18   76079450  61838225  hp1  523-376p      1 2016-07-01 00:02:33        
19   76079450  61838497  hp1  523-376p      6 2016-07-01 00:02:55

score 2 · Accepted Answer · edited Jul 07 '16 at 16:10

2

I don't know if it'll be faster, but you can try using DataFrame.duplicated to filter in only the duplicates.

ap = ot[ot.duplicated(subset=['pborderid', 'to_wpadr'], keep=False)]

Timing against 1M rows DF:

In [244]: df = pd.concat([df] * 10**5, ignore_index=True)

In [245]: %timeit df.groupby(["pborderid","to_wpadr"],sort=False).filter(lambda x: len(x) >1)
1 loop, best of 3: 313 ms per loop

In [246]: %timeit df[df.duplicated(subset=['pborderid', 'to_wpadr'], keep=False)]
10 loops, best of 3: 129 ms per loop

In [247]: df.shape
Out[247]: (1000000, 6)

edited Jul 07 '16 at 16:10

MaxU - stand with Ukraine

205,989
36
386
419

answered Jul 07 '16 at 16:02

Alicia Garcia-Raboso

13,193
1
43
48

I've added timing to your answer - it's almost 2.5 times faster ;) – MaxU - stand with Ukraine Jul 07 '16 at 16:14

score 0 · Answer 2 · edited May 23 '17 at 11:44

Inspired by this solution, we can replace groupby operation in this case too. The implementation would look something like this -

# Create numerical IDs for relevant columns and a combined one
ID1 = np.unique(df['pborderid'],return_inverse=True)[1]
ID2 = np.unique(df['to_wpadr'],return_inverse=True)[1]
ID = np.column_stack((ID1,ID2))

# Convert to linear indices
lidx = np.ravel_multi_index(ID.T,ID.max(0)+1)

# Get unique IDs for each element based on grouped uniqueness and group counts
_,ID,count = np.unique(lidx,return_inverse=True,return_counts=True)

# Look for counts>1 and collect respective IDs and thus respective rows off df
df_out = df[np.in1d(ID,np.where(count>1)[0])]

Sample run -

In [233]: df
Out[233]: 
  pborderid to_wpadr  values
0       ID0     John       3
1       ID1    Erick       8
2       ID1     John       3
3       ID1    Mandy       5
4       ID0    Erick       9
5       ID1    Mandy       4
6       ID0     John       6
7       ID1     John       1

In [234]: df_out
Out[234]: 
  pborderid to_wpadr  values
0       ID0     John       3
2       ID1     John       3
3       ID1    Mandy       5
5       ID1    Mandy       4
6       ID0     John       6
7       ID1     John       1

Runtime tests at my end doesn't seem to suggest any improvement with this approach over groupby approach as listed in the other solution. So, it does seem like df.groupby would be the preferred way to go!

Thanks Divakar. I though most pandas functions such as duplicated are already build uppon numpy. But still curious if your method will be faster than the previous one, which was already quite an improvement. Problem with your code is that I should find unique values based on two columns so a 2d object. I tried it with view, but pandas dataframe does not has this property. So how can I apply your method to find unique rows based on multiple columns? Many thanks in advance! — XsjakieX, Jul 07 '16 at 20:24
@XsjakieX Yeah, no improvement whatsoever with this approach over the other `groupby` one. So, I would say stick with `groupby`. It was hard to beat that built-in! Fixed my code to incorporate two columns based `groupby`. — Divakar, Jul 07 '16 at 22:19

Pandas filter performing slow due to too many groups

2 Answers2