I have a large df which I need to alter. For one computation I want to group my data on pborderid and then only keep the rows which have a duplicate value in to_wpadr.
Here is an example of my df.
pborderid pbcarid wsid to_wpadr colli pk_end_time
10 76079450 61838497 hp1 523-369p 1 2016-07-01 00:00:38
11 76079450 61838504 hp1 523-370p 1 2016-07-01 00:00:47
12 76079450 61838110 hp1 523-372p 1 2016-07-01 00:01:05
13 76079450 61838225 hp1 523-372p 2 2016-07-01 00:01:13
14 76079450 61838504 hp1 523-372p 3 2016-07-01 00:01:30
15 76079450 61838497 hp1 523-373p 1 2016-07-01 00:01:45
16 76079450 61838504 hp1 523-377p 1 2016-07-01 00:01:55
17 76079450 61838110 hp1 523-376p 5 2016-07-01 00:02:26
18 76079450 61838225 hp1 523-376p 1 2016-07-01 00:02:33
19 76079450 61838497 hp1 523-376p 6 2016-07-01 00:02:55
Using the following code works, however when the number of groups increases the perfomance decreases significantly, taking up to 20 seconds. I think it should be doable by only using pborderid as group and then filter/apply.
ap=ot.groupby(["pborderid","to_wpadr"],sort=False).filter(lambda x: len(x) >1)
Anyone have any suggestions how I can only use one groupcolumn or how I can improve the performance otherwise?
Here is the wanted outcome:
pborderid pbcarid wsid to_wpadr colli pk_end_time
12 76079450 61838110 hp1 523-372p 1 2016-07-01 00:01:05
13 76079450 61838225 hp1 523-372p 2 2016-07-01 00:01:13
14 76079450 61838504 hp1 523-372p 3 2016-07-01 00:01:30
17 76079450 61838110 hp1 523-376p 5 2016-07-01 00:02:26
18 76079450 61838225 hp1 523-376p 1 2016-07-01 00:02:33
19 76079450 61838497 hp1 523-376p 6 2016-07-01 00:02:55