In a workflow matching up a spec against some allowed values, I wish to find which rows (index) are matching a spec.
This is different from Pandas, isin, column of lists, as I an not matching each row against a (static) list.
I can do it with .explode
and .groupby
or with .apply
, but the first seems very complicated and the second has performance issues. There must be a better way, but which? I can't make .isin
broadcast, which otherwise would seem like the best solution.
(Example code also at https://colab.research.google.com/drive/1d8v6n99NPBaSufOsaWe3eRRrMgoSG_rr?usp=sharing)
import pandas as pd
import numpy as np
df = pd.DataFrame(data = {
'name': ['a', 'b', 'c'],
'lst': [[0,2,4], [1,2], []],
'spec': [2,4,0]
})
expect = pd.DataFrame(data= {
'name': ['a', 'b', 'c'],
'match_spec': [True, False, False]
})
def check(f):
try:
got = f()
result = (expect['match_spec'] == got)
ok = result.all()
if ok:
print(f'OK {f}')
else:
print(f'FAIL {f}\n{got}')
except Exception as ex:
print(f'ERROR {f}\n{ex}')
def naive_broadcast(): return df.spec in df.lst
check(naive_broadcast)
def result_apply(): return df.apply(lambda x: x.spec in x.lst, axis=1)
check(result_apply)
def naive_isin(): return df.spec.isin(df.lst)
check(naive_isin)
def vectorization(): np.vectorize(df.spec.isin)(df.lst.values)
check(vectorization)
# Another ugly way, exploding and grouping
def explode_groupby():
exp = df.explode('lst')
return (exp.assign(m = (exp['lst'].eq(exp['spec'])))
.groupby('name')
.agg(match_spec=('m', max))
.reset_index()['match_spec'])
check(explode_groupby)
The above code produces:
ERROR <function naive_broadcast at 0x7fe0a4f6c9d0>
unhashable type: 'Series'
OK <function result_apply at 0x7fe0a4f6ca60>
FAIL <function naive_isin at 0x7fe0a4f6c310>
0 False
1 False
2 False
Name: spec, dtype: bool
ERROR <function vectorization at 0x7fe0a4f6c940>
setting an array element with a sequence.
OK <function explode_groupby at 0x7fe0a4f6ce50>