Getting all NaT values for Pandas calc that searches for Max Date by Group

Question

code

reviewer_map['Max Date of Review'] = reviewer_map.groupby('UserID_NB').agg({'Date of Review': [np.max]})

dtypes

UserID_NB                     object
Technical Director            object
Date of Review        datetime64[ns]
Max Date of Review    datetime64[ns]

output looks like

UserID_NB       Technical Director      Date of Review      Max Date of Review
FRANK1          Frank                   2017-04-20          NaT
JOHN2           John                    2017-04-20          NaT

I'd like to show the date of each review along with the director and id number of the reviewer. I've got some duplicates from reschedules so I'm trying to limit this list to the most recent date value for each USERID_NB. For some reason my max values are coming back as missing.

can you proviide a minimal (copyable) input that shows this behaviour (see [mcve])? — MSeifert, Apr 24 '17 at 15:49
If I'm following this correctly you're asking for a dataset that replicates the problem, correct? — mdl003, Apr 24 '17 at 16:08
Exactly :) Doesn't need to be your dataset, just something that can be used to get a similar (problematic) result. — MSeifert, Apr 24 '17 at 16:13

score 2 · Accepted Answer · edited May 23 '17 at 12:03

Using groupby() in this case returns a data frame, not a column to be added to an existing data frame. That's why you were getting missing values in Max Date of Review.
It also seems you don't really need both Date of Review and Max Date of Review in your final output.
Additionally, 'max' is sufficient as an agg() function, you don't need np.max, nor do you need to wrap it in a list.

You basically already solved your own problem, minus a few syntax issues. It's easy enough to set straight.

First, here's some sample data:

dates = pd.date_range('20170101', periods=6, freq='D')
uid = ['FRANK1','JOHN2','FRANK1','JOHN2','FRANK1','FRANK1']
name = ['Frank','John','Frank','JohnABC','Frank','Frank123']
reviewer_map = pd.DataFrame({'UserID_NB':uid,
                             'Technical Director':name,
                             'Date of Review':dates})

print(reviewer_map)
  Date of Review Technical Director UserID_NB
0     2017-01-01              Frank    FRANK1
1     2017-01-02               John     JOHN2
2     2017-01-03              Frank    FRANK1
3     2017-01-04            JohnABC     JOHN2
4     2017-01-05              Frank    FRANK1
5     2017-01-06           Frank123    FRANK1

If you want to use groupby() to accomplish your goal, this will work:

print(reviewer_map.groupby('UserID_NB', as_index=False)
                  .agg({'Date of Review': 'max'})
                  .rename(columns={'Date of Review':'Max Date of Review'}))

  UserID_NB Max Date of Review
0    FRANK1         2017-01-06
1     JOHN2         2017-01-04

Note that the rename() at the end isn't strictly necessary, it's only for if you actually want to rename the Date of Review column as Max Date of Review.

UPDATE
Per OP comments, here's a version that includes the Technical Director field that matches the max Date of Review. Note that the SQL having syntax can sometimes be mimicked with filter(), but it's not always a straightforwardly analogous operation and often requires multiple groupby operations.

Here I used merge() instead, which matches the Technical Director values from the original data frame with the max date rows of the groupby output. Example data now includes different Technical Director values per UserID_NB, in order to illustrate this update.

df = (reviewer_map.groupby('UserID_NB', as_index=False)
              .agg({'Date of Review': 'max'}))
df.merge(reviewer_map, 
         on=['UserID_NB','Date of Review'], 
         how='left')

Output:

  UserID_NB Date of Review Technical Director
0    FRANK1     2017-01-06           Frank123
1     JOHN2     2017-01-04            JohnABC

This is great, thanks. If I did want Max Date of Review in my output (I only want to keep those entries where the Date of Review = Max Date of Review) how would I do that? — mdl003, Apr 24 '17 at 18:25
You're welcome! Not sure I understand your question here though. `Max Date of Review` *is* the output - the `agg()` here only returns those entries where `Date of Review == Max Date of Review`. See the example output - only the max dates for each `UserID_NB` are returned. — andrew_reece, Apr 24 '17 at 18:28
I get what youre saying. My example omitted that technical director and userID_nb aren't always the same, which is why I was grouping by userID_nb (my bad). my ideal output set would look like: userID_NB Technical Director Max Date of Review — mdl003, Apr 24 '17 at 18:40
if I'd done this in sql itd look like select userid, techdirector, date of review from dataset group by userid_nb having max(date of review) = date of review if that helps — mdl003, Apr 24 '17 at 18:42
yes it does! thanks again. only thing I had to change was the join (switched the df and reviewer_map so reviewer_map was on the left side) — mdl003, Apr 24 '17 at 19:44
Glad I could help. If this answer adequately resolved your issue, please mark it as accepted by clicking the check mark next to the answer. — andrew_reece, Apr 24 '17 at 19:47

Getting all NaT values for Pandas calc that searches for Max Date by Group

1 Answers1