Numpy apply function to group in structured array

Question

Starting off with a structured numpy array that has 4 fields, I am trying to return an array with just the latest dates, by ID, containing the same 4 fields. I found a solution using itertools.groupby that almost works here: Numpy Mean Structured Array

The problem is I don't understand how to adapt this when you have 4 fields instead of 2. I want to get the whole 'row' back, but only the rows for the latest dates for each ID. I understand that this kind of thing is simpler using pandas, but this is just a small piece of a larger process, and I can't add pandas as a dependency.

data = np.array([('2005-02-01', 1, 3, 8),
             ('2005-02-02', 1, 4, 9),
             ('2005-02-01', 2, 5, 10),
             ('2005-02-02', 2, 6, 11),
             ('2005-02-03', 2, 7, 12)], 
             dtype=[('dt', 'datetime64[D]'), ('ID', '<i4'), ('f3', '<i4'),    
             ('f4', '<i4')])

For this example array, my desired output would be:

np.array([(datetime.date(2005, 2, 2), 1, 4, 9),
          (datetime.date(2005, 2, 3), 2, 7, 12)],
         dtype=[('dt', '<M8[D]'), ('ID', '<i4'), ('f3', '<i4'), ('f4', '<i4')])

This is what I've tried:

latest = np.array([(k, np.array(list(g), dtype=data.dtype).view(np.recarray)
              ['dt'].argmax()) for k, g in 
              groupby(np.sort(data, order='ID').view(np.recarray),
              itemgetter('ID'))], dtype=data.dtype)

I get this error:

ValueError: size of tuple must match number of fields.

I think this is because the tuple has 2 fields but the array has 4. When I drop 'f3' and 'f4' from the array it works correctly.

How can I get it to return all 4 fields?

I would strongly recommend using pandas for this. It would be much easier. — reptilicus, Apr 14 '15 at 23:05
What exactly is your desired output for the example array above? — ali_m, Apr 14 '15 at 23:41
Is it correct that you only want to keep the `'dt'` and `'ID'` fields in the result? — ali_m, Apr 14 '15 at 23:54
`array([(datetime.date(2005, 2, 2), 1, 4, 9), (datetime.date(2005, 2, 3), 2, 7, 12)], dtype=[('dt', ' — Matt Warren, Apr 14 '15 at 23:56
Your 'latest' expression is too busy. Unpack it and figure out exactly where the error is occurring. — hpaulj, Apr 15 '15 at 02:22

hpaulj · Accepted Answer · 2015-04-15T02:16:23.630

Lets figure out where your error is by pealing off one layer:

In [38]: from operator import itemgetter
In [39]: from itertools import groupby
In [41]: [(k, np.array(list(g), dtype=data.dtype).view(np.recarray)
          ['dt'].argmax()) for k, g in 
          groupby(np.sort(data, order='ID').view(np.recarray),
          itemgetter('ID'))]
Out[41]: [(1, 1), (2, 2)]

What is this list of tuples supposed to represent? It clearly isn't rows from data. And since each tuple has only 2 items it can't be mapped onto a data.dtype array. Hence the value error.

After playing around with this a bit, I think: [(1, 1), (2, 2)] means, for ID==1, use the [1] item from the group; for ID==2, use [2] item from the group.

[(datetime.date(2005, 2, 2), 1, 4, 9),
 (datetime.date(2005, 2, 3), 2, 7, 12)]

You have found the maximum dates, but you have to translate those to either indexes in data, or select those items from the groups.

In [91]: groups=groupby(np.sort(data, order='ID').itemgetter('ID'))
# don't need recarray

In [92]: G = [(k,list(g)) for k,g in groups]

In [93]: G
Out[93]: 
[(1,
  [(datetime.date(2005, 2, 1), 1, 3, 8),
   (datetime.date(2005, 2, 2), 1, 4, 9)]),
 (2,
  [(datetime.date(2005, 2, 1), 2, 5, 10),
   (datetime.date(2005, 2, 2), 2, 6, 11),
   (datetime.date(2005, 2, 3), 2, 7, 12)])]
In [107]: I=[(1,1), (2,2)]

In [108]: [g[1][i[1]] for g,i in zip(G,I)]
Out[108]: [(datetime.date(2005, 2, 2), 1, 4, 9), (datetime.date(2005, 2, 3), 2, 7, 12)]

OK, this selection from G is clumsy, but it is a start.

If I define a simple function to pull the record with the latest date from a group, the processing is a lot simpler.

def maxdate_record(agroup):
    an_array = np.array(list(agroup))
    i = np.argmax(an_array['dt'])
    return an_array[i]

groups = groupby(np.sort(data, order='ID'),itemgetter('ID'))
np.array([maxdate_record(g) for k,g in groups])

producing:

array([(datetime.date(2005, 2, 2), 1, 4, 9),
       (datetime.date(2005, 2, 3), 2, 7, 12)], 
      dtype=[('dt', '<M8[D]'), ('ID', '<i4'), ('f3', '<i4'), ('f4', '<i4')])

I don't need to specify dtype when I convert a list of records to an array, since the records have their own dtype.

Thanks, the last function is exactly what I was looking for. This is the first time I've used the itertools library, is there a way to 'look under the hood' of a groupby object? For example, when I input the `groups` object, all I get back is ``, which makes it hard to tell if it's doing what I want it to. — Matt Warren, Apr 15 '15 at 16:19

Numpy apply function to group in structured array

1 Answers1