3

I have following df:

data = [
    ['a', 'one', 10],
    ['a', 'two', 10],
    ['b', 'one', 13],
    ['b', 'two', 100],
    ['c', 'two', 100],
    ['c', 'one', 100],
    ['d', 'one', 100],
    ['d', 'one', 10
]
df = pd.DataFrame(data, columns=['key1', 'key2', 'key3'])
key1    key2    key3
0   a   one 10
1   a   two 10
2   b   one 13
3   b   two 100
4   c   two 100
5   c   one 100
6   d   one 100
7   d   one 10

I need to group by key1,

  • search for key2 == 'one' (for single occurrence result = OK, for 0 or >1 = NOK),
  • also additionally check if key3 == 10 or 100 then OK, anything else is NOK.

As result I should have OK or NOK. How can I get something like the following as output?

key1  result
a   ok
b   nok
c   ok
d   nok

I found a similar question and answer here but without additional check for key3. What I have so far is the following:

test = df.groupby('key1')['key2'].apply(lambda x:(x=='one').sum()).reset_index(name='result')
test['result'].where(~(test['result'] > 1), other='nok', inplace=True)
test['result'].replace([0, 1], ['nok', 'ok'], inplace = True)

When I run this code, it gives me following output:

key1    result
0   a   ok
1   b   ok
2   c   ok
3   d   nok

How can I add the check for key3?

Joshua Shew
  • 618
  • 2
  • 19
Koke Abeke
  • 107
  • 7

2 Answers2

2

IIUC, you can try:

def fn(x):
    # filter "one"
    x = x[x['key2'] == 'one']

    # there are zero or >1, return "nok"
    if len(x) == 0 or len(x) > 1:
        return 'nok'

    # is the single `key3` 10 or 100? If yes, return "ok"
    if x['key3'].iat[0] in [10, 100]:
        return 'ok'

    # otherwise "nok"
    return 'nok'

out = df.groupby('key1').apply(fn).reset_index(name="result")
print(out)

Prints:

  key1 result
0    a     ok
1    b    nok
2    c     ok
3    d    nok
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • quick question, if I want for example to change 'one' for another value and take it as an input to fn(x, key2_var). How should I call this function inside of apply(fn)? – Koke Abeke Jul 30 '23 at 12:36
  • @KokeAbeke You could call it like `out = df.groupby('key1').apply(lambda x: fn(x, "other value"))` – Andrej Kesely Jul 30 '23 at 13:00
1

One-liner using:

  • groupby('key1')['key2'].value_counts().xs('one', level='key2') == 1) for the check on key2
  • groupby('key1')['key3'].unique() to get key3 values for each key1
  • isin([10, 100]).all(axis=1) to check condition on key3
pd.DataFrame({"result": ((df.groupby("key1")["key2"].value_counts().xs("one", level="key2") == 1) & (pd.DataFrame.from_dict(dict(zip(df.groupby("key1")["key3"].unique().index, df.groupby("key1")["key3"].unique().values)), orient="index").isin([10, 100, np.nan]).all(axis=1))).replace([True, False], ["ok", "nok"]),})

It is ~2x slower than using groupby().apply(). See comparison below:

def fast(): # from @Adrej Kesely (10035985)
    def fn(x):
        # filter 'one'
        x = x[x['key2'] == 'one']

        # there are zero or >1, return 'nok'
        if len(x) == 0 or len(x) > 1:
            return 'nok'

        # is the single `key3` 10 or 100? If yes, return 'ok'
        if x['key3'].iat[0] in [10, 100]:
            return 'ok'

        # otherwise 'nok'
        return 'nok'
    return df.groupby('key1').apply(fn).reset_index(name='result')

%timeit fast()
# 677 µs ± 13.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

def slow():
    return pd.DataFrame({"result": ((df.groupby("key1")["key2"].value_counts().xs("one", level="key2") == 1) & (pd.DataFrame.from_dict(dict(zip(df.groupby("key1")["key3"].unique().index, df.groupby("key1")["key3"].unique().values)), orient="index").isin([10, 100, np.nan]).all(axis=1))).replace([True, False], ["ok", "nok"]),})

%timeit slow()
# 1.42 ms ± 36.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Documentation for methods used:

Joshua Shew
  • 618
  • 2
  • 19