0

I am working with a very large dataframe (3.5 million X 150 and takes 25 gigs of memory when unpickled) and I need to find maximum of one column over an id number and a date and keep only the row with the maximum value. Each row is a recorded observation for one id at a certain date and I also need the latest date.

This is animal test data where there are twenty additional columns seg1-seg20 for each id and date that are filled with test day information consecutively, for example, first test data fills seg1, second test data fills seg2 ect. The "value" field indicates how many segments have been filled, in other words how many tests have been done, so the row with the maximum "value" has the most test data. Ideally I only want these rows and not the previous rows. For example:

df= DataFrame({'id':[1000,1000,1001,2000,2000,2000], 
          "date":[20010101,20010201,20010115,20010203,20010223,20010220],
          "value":[3,1,4,2,6,6], 
          "seg1":[22,76,23,45,12,53],
          "seg2":[23,"",34,52,24,45],
          "seg3":[90,"",32,"",34,54],
          "seg4":["","",32,"",43,12],
          "seg5":["","","","",43,21],
          "seg6":["","","","",43,24]})
df
       date    id  seg1 seg2 seg3 seg4 seg5 seg6  value
0  20010101  1000    22   23   90                     3
1  20010201  1000    76                               1
2  20010115  1001    23   34   32   32                4
3  20010203  2000    45   52                          2
4  20010223  2000    12   24   34   43   43   41      6
5  20010220  2000    12   24   34   43   44   35      6

And eventually it should be:

       date    id  seg1 seg2 seg3 seg4 seg5 seg6  value
0  20010101  1000    22   23   90                     3
2  20010115  1001    23   34   32   32                4
4  20010223  2000    12   24   34   43   43   41      6

I first tried to use .groupby('id').max but couldnt find a way to use it to drop rows. The resulting dataframe MUST contain the ORIGINAL ROWS and not just the maximum value of each column with each id. My current solution is:

for i in df.id.unique():
    df =df.drop(df.loc[df.id==i].sort(['value','date']).index[:-1])

But this takes around 10 seconds to run each time through, I assume because its trying to call up the entire dataframe each time through. There are 760,000 unique ids, each are 17 digits long, so it will take way too long to be feasible at this rate.

Is there another method that would be more efficient? Currently it reads every column in as an "object" but converting relevant columns to the lowest possible bit of integer doesnt seem to help either.

jhutch
  • 15
  • 1
  • 4

1 Answers1

1

I tried with groupby('id').max() and it works, and it also drop the rows. Did you remeber to reassign the df variable? Because this operation (and almost all Pandas' operations) are not in-place.

If you do:

df.groupby('id', sort = False).max()

You will get:

          date  value
id                   
1000  20010201      3
1001  20010115      4
2000  20010223      6

And if you don't want id as the index, you do:

df.groupby('id', sort = False, as_index = False).max()

And you will get:

     id      date  value
0  1000  20010201      3
1  1001  20010115      4
2  2000  20010223      6

I don't know if that's going to be much faster, though.

Update

This way the index will not be reseted:

df.iloc[df.groupby('id').apply(lambda x: x['value'].idxmax())]

And you will get:

           date    id  seg1 seg2 seg3 seg4 seg5 seg6  value
0  20010101  1000    22   23   90                     3
2  20010115  1001    23   34   32   32                4
4  20010223  2000    12   24   34   43   43   43      6
Community
  • 1
  • 1
Roberto Soares
  • 244
  • 1
  • 11
  • Im away from the server where this data is so cant confirm if this is fast enough, but I will check it next chance I get. Thanks for the response. – jhutch Jul 10 '15 at 13:11
  • I rewrote the question because I dont think I wrote it well enough the first time. I needed to clarify that I need the **original** row with the max value and not just the maximum value associated with each id. This will just give me the id with the maximum "value." Sorry for the confusion – jhutch Jul 13 '15 at 14:58
  • 1
    worked great and efficiently! idxmax() was the magic function I was looking for. thank you! – jhutch Aug 04 '15 at 18:53
  • Change `iloc` to `loc` `df.iloc[...]`. See https://stackoverflow.com/questions/44123056/indexerror-positional-indexers-are-out-of-bounds-when-theyre-demonstrably-no – Test May 24 '22 at 09:48