Starting off with a structured numpy array that has 4 fields, I am trying to return an array with just the latest dates, by ID, containing the same 4 fields. I found a solution using itertools.groupby
that almost works here:
Numpy Mean Structured Array
The problem is I don't understand how to adapt this when you have 4 fields instead of 2. I want to get the whole 'row' back, but only the rows for the latest dates for each ID. I understand that this kind of thing is simpler using pandas, but this is just a small piece of a larger process, and I can't add pandas as a dependency.
data = np.array([('2005-02-01', 1, 3, 8),
('2005-02-02', 1, 4, 9),
('2005-02-01', 2, 5, 10),
('2005-02-02', 2, 6, 11),
('2005-02-03', 2, 7, 12)],
dtype=[('dt', 'datetime64[D]'), ('ID', '<i4'), ('f3', '<i4'),
('f4', '<i4')])
For this example array, my desired output would be:
np.array([(datetime.date(2005, 2, 2), 1, 4, 9),
(datetime.date(2005, 2, 3), 2, 7, 12)],
dtype=[('dt', '<M8[D]'), ('ID', '<i4'), ('f3', '<i4'), ('f4', '<i4')])
This is what I've tried:
latest = np.array([(k, np.array(list(g), dtype=data.dtype).view(np.recarray)
['dt'].argmax()) for k, g in
groupby(np.sort(data, order='ID').view(np.recarray),
itemgetter('ID'))], dtype=data.dtype)
I get this error:
ValueError: size of tuple must match number of fields.
I think this is because the tuple has 2 fields but the array has 4. When I drop 'f3'
and 'f4'
from the array it works correctly.
How can I get it to return all 4 fields?