5

I would like to analyze statistics per cars which were repairs and which are new. Data sample is:

Name   IsItNew    ControlDate
Car1    True      31/01/2018
Car2    True      28/02/2018
Car1    False     15/03/2018
Car2    True      16/04/2018
Car3    True      30/04/2018
Car2    False     25/05/2018
Car1    False     30/05/2018    

So, I should groupby by Name and if there is a False in IsItNew column I should set False and the first date, when False was happened.

I tried groupby with nunique():

df = df.groupby(['Name','IsItNew', 'ControlDate' ])['Name'].nunique()

But, it returns count of unique items in each group.

How can I receive only grouped unique items without any count?

Actual result is:

Name   IsItNew       ControlDate
Car1    True         31/01/2018     1
        False        15/03/2018     1
                     30/05/2018     1
Car2    True         28/02/2018     1
                     16/04/2018     1
        False        25/05/2018     1 
Car3    True         30/04/2018     1


Expected Result is:

Name   IsItNew     ControlDate
Car1    False      15/03/2018
Car2    False      25/05/2018
Car3    True       30/04/2018

I'd appreciate for any idea. Thanks)

Georgy
  • 12,464
  • 7
  • 65
  • 73
Cindy
  • 568
  • 7
  • 20
  • 1
    If add next row `Car3 True 01/05/2017` then in output is `30/04/2018` or `30/04/2018` for `Car3` ? – jezrael Mar 28 '19 at 11:18
  • if we add ```Car3``` with ```True```and ```01/05/2017``` Date then result for ```True``` values should be last date ```30/04/2018```, not first one as for ```False```. So, correct behavior is last actual date for ```True``` and first one for ```False``` value – Cindy Mar 28 '19 at 15:52
  • 1
    Please check edited answer. – jezrael Mar 28 '19 at 17:18
  • @jezrael, theoretically, correct behavior will be ```20/10/2019``` date for Car3, because it shows the end actual status of Car3. If compare it with current date today(28/03/2019), than ```20/10/2019``` is future. Also, Date ```30/04/2017``` is date, when Car3 firstly appears in dataset and after the next item for Car3 set under 30/04/2018. I understand that you mean what happens if the column with ControlDate will not be sorted and contains future date.. Does it make sense? – Cindy Mar 28 '19 at 18:13
  • So solution working like expected? If all Trues, get last True, else first False per groups. – jezrael Mar 28 '19 at 18:19
  • yes, suggested solution works as expected. All right, Thanks) – Cindy Mar 29 '19 at 08:16
  • Firstly, I took the first suggested solution. When I run it for more than 60 K rows, the question about time execution was happened. I estimated the time of executing the suggested query from your comment and results are: ```--- 0.5322470664978027 seconds ---```, the second query is executed during ```--- 68.36492991447449 seconds ---``` That's why I changed the accepted answer of current question. – Cindy Mar 29 '19 at 16:26

2 Answers2

2

One way to do it would be to GroupBy the Name, and aggregate on IsItNew with two functions. A custom one using any to check if there are any False values, and idxmin, to find the index of the first False, which you can later on use to index the dataframe on ControlDate:

df_ = df.groupby('Name').agg({'IsItNew':
                                   {'IsItNew':lambda x: ~(~x).any(),
                                    'ControlDate':'idxmin'}})
                        .droplevel(0, axis=1)
                        .reset_index()

df_['ControlDate'] = df.loc[df_['ControlDate'].values, 'ControlDate'].reset_index(drop=True)

  xName  IsItNew ControlDate
0  Car1    False  15/03/2018
1  Car2    False  25/05/2018
2  Car3     True  30/04/2018
yatu
  • 86,083
  • 12
  • 84
  • 139
  • I think because performance with datetimes is better like min of strings. – jezrael Mar 28 '19 at 10:22
  • 1
    ya, just think both solution should be good, it depends what op need. – jezrael Mar 28 '19 at 10:41
  • @yatu Thank you so much for your solution. It also works as needed, but the time of execution for table with more than 60K rows is ```--- 68.36492991447449 seconds ---``` if compare with previous idea (where the time is ```0.5322470664978027 sec```) – Cindy Mar 29 '19 at 16:28
1

First convert column to datetimes by to_datetime, then sorting by 3 columns with DataFrame.sort_values and last get first row by column Names by DataFrame.drop_duplicates:

df['ControlDate'] = pd.to_datetime(df['ControlDate'])

df = df.sort_values(['Name','IsItNew', 'ControlDate']).drop_duplicates('Name')

print (df)
   Name  IsItNew ControlDate
2  Car1    False  2018-03-15
5  Car2    False  2018-05-25
4  Car3     True  2018-04-30

EDIT:

print (df)
   Name  IsItNew ControlDate
0  Car1     True  31/01/2018
1  Car2     True  28/02/2018
2  Car1    False  15/03/2018
3  Car2     True  16/04/2018
4  Car3     True  30/04/2018
5  Car2    False  25/05/2018
6  Car1    False  30/05/2018
7  Car3     True  20/10/2019
8  Car3     True  30/04/2017

#set to datetimes
df['ControlDate'] = pd.to_datetime(df['ControlDate'])
#sorting by 3 columns
df1 = df.sort_values(['Name','IsItNew', 'ControlDate'])

#create Series for replace
s = df1.drop_duplicates('Name', keep='last').set_index('Name')['ControlDate']

#filter by Falses
df2 = df1.drop_duplicates('Name').copy()
#replace True rows by last timestamp
df2.loc[df2['IsItNew'], 'ControlDate'] = df2.loc[df2['IsItNew'], 'Name'].map(s)
print (df2)
   Name  IsItNew ControlDate
2  Car1    False  2018-03-15
5  Car2    False  2018-05-25
8  Car3     True  2019-10-20
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252