1

I've a dataset


id ref name conditionCol
1 123 a no_error
1 456 b error
1 789 c no_error
2 231 d no_error
2 312 e no_error
2 546 f no_error
3 645 g error
3 879 h error
4 789 i no_error
4 978 j no_error

I'm trying to create a custom error_flag, condition being:

  • for each unique id column elements
  • if any row in the conditionCol has the keyword error, then
  • for each row should be flagged as yes in the error_flag
  • if for any element in id column
  • not even a single row has the keyword error in conditionCol column, then
  • for each row should be flagged as no in the error_flag

E.g. For id:1, all the values of error_flag is yes, as for id value 1, row #2 of conditionCol has error


id ref name conditionCol error_flag
1 123 a no_error yes
1 456 b error yes
1 789 c no_error yes

But, for id:2, all the values of error_flag is no, as for id value 2, no row of conditionCol has error


id ref name conditionCol error_flag
2 231 d no_error no
2 312 e no_error no
2 546 f no_error no

Similarly for id value 3 & 4:


id ref name conditionCol error_flag
3 645 g no_error no
3 879 h no_error no
4 789 i error yes
4 978 j error yes

And final output being:


id ref name conditionCol error_flag
1 123 a no_error yes
1 456 b error yes
1 789 c no_error yes
2 231 d no_error no
2 312 e no_error no
2 546 f no_error no
3 645 g no_error no
3 879 h no_error no
4 789 i error yes
4 978 j error yes

Update:


If you wish to play around with the dataset:


import pandas as pd
import numpy as np

id_col = [1,1,1,2,2,2,3,3,4,4]
ref_col = [123,456, 789, 231, 312, 546, 645, 879, 789, 978]
name_col = ['a','b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
conditionCol = ['no_error', 'error', 'no_error', 'no_error', 'no_error', 'no_error', 'no_error', 'no_error', 'error', 'error']
df = pd.DataFrame(zip(id_col, ref_col, name_col, conditionCol), columns=['id','ref','name','conditionCol'])
df

update2: Is there a way to work with thresholds, i.e.:

  • current question: atleast one occurrence of keyword error in conditionCol column for each individual unique ids, then the value in error_flag would be yes for all the rows in that id value
  • atleast 4 or atleast 5 occurrence of keyword error in conditionCol column for unique ids, then only the value in error_flag would be yes for all the rows in that id value

1 Answers1

0

Use numpy.where with test if at least one value error per groups by id:

m = df['id'].isin(df.loc[df['conditionCol'].eq('error'), 'id'])
#alternative
#m = df['conditionCol'].eq('error').groupby(df['id']).transform('any')
df['error_flag'] = np.where(m, 'yes', 'no')

print (df)
   id  ref name conditionCol error_flag
0   1  123    a     no_error        yes
1   1  456    b        error        yes
2   1  789    c     no_error        yes
3   2  231    d     no_error         no
4   2  312    e     no_error         no
5   2  546    f     no_error         no
6   3  645    g     no_error         no
7   3  879    h     no_error         no
8   4  789    i        error        yes
9   4  978    j        error        yes
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Trying to expand solution's scope and making it dynamic, is there a way to set a threshold to take decision on yes/no in the `error_flag` column? E.g. currently even a single occurrence of keyword `error` in the `conditionCol` sets all the rows to yes. But if we want to see if the keyword `error` occurs at least x-times, then only the `error_flag` column should have yes, otherwise no – fast_crawler Mar 21 '22 at 11:48
  • @JaspreetSingh Yes, then change `m = df['conditionCol'].eq('error').groupby(df['id']).transform('any')` to `m = df['conditionCol'].eq('error').groupby(df['id']).transform('sum').lt(x)` and `x` is threshold – jezrael Mar 21 '22 at 11:50
  • 1
    Yes, it works (also had to switch yes/no and make it as `df['error_flag'] = np.where(m, 'no', 'yes')` ) but it worked. You gave me a lot of new methods to learn today, thanks :) – fast_crawler Mar 21 '22 at 12:02