1

In a workflow matching up a spec against some allowed values, I wish to find which rows (index) are matching a spec.

This is different from Pandas, isin, column of lists, as I an not matching each row against a (static) list.

I can do it with .explode and .groupby or with .apply, but the first seems very complicated and the second has performance issues. There must be a better way, but which? I can't make .isin broadcast, which otherwise would seem like the best solution.

(Example code also at https://colab.research.google.com/drive/1d8v6n99NPBaSufOsaWe3eRRrMgoSG_rr?usp=sharing)

import pandas as pd
import numpy as np

df = pd.DataFrame(data = {
    'name': ['a', 'b', 'c'],
    'lst': [[0,2,4], [1,2], []],
    'spec': [2,4,0]
})
expect = pd.DataFrame(data= {
    'name': ['a', 'b', 'c'],
    'match_spec': [True, False, False] 
})

def check(f):
  try:
    got = f()
    result = (expect['match_spec'] == got)
    ok = result.all()
    if ok:
      print(f'OK {f}')
    else:
      print(f'FAIL {f}\n{got}')
  except Exception as ex:
    print(f'ERROR {f}\n{ex}')

def naive_broadcast(): return df.spec in df.lst
check(naive_broadcast)

def result_apply(): return df.apply(lambda x: x.spec in x.lst, axis=1)
check(result_apply)

def naive_isin(): return df.spec.isin(df.lst)
check(naive_isin)

def vectorization(): np.vectorize(df.spec.isin)(df.lst.values)
check(vectorization)

# Another ugly way, exploding and grouping
def explode_groupby():
  exp = df.explode('lst')
  return (exp.assign(m = (exp['lst'].eq(exp['spec'])))
    .groupby('name')
    .agg(match_spec=('m', max))
    .reset_index()['match_spec'])
check(explode_groupby)

The above code produces:

ERROR <function naive_broadcast at 0x7fe0a4f6c9d0>
unhashable type: 'Series'
OK <function result_apply at 0x7fe0a4f6ca60>
FAIL <function naive_isin at 0x7fe0a4f6c310>
0    False
1    False
2    False
Name: spec, dtype: bool
ERROR <function vectorization at 0x7fe0a4f6c940>
setting an array element with a sequence.
OK <function explode_groupby at 0x7fe0a4f6ce50>
Helge Jensen
  • 133
  • 10

2 Answers2

0

If you are concerned about performance here's something new

pd.DataFrame([*df['lst']]).eq(df['spec'], axis=0).any(axis=1)

Result

0     True
1    False
2    False
dtype: bool
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • Interesting approach. I am not convinced about the performance though: ``` %timeit result_apply() 518 µs ± 9.48 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) def explode_eq(): return pd.DataFrame([*df['lst']]).eq(df['spec'], axis=0).any(axis=1) %timeit explode_eq() 746 µs ± 21.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) ``` – Helge Jensen Feb 08 '23 at 18:43
  • @HelgeJensen Try a dataframe of bigger size for e.g. 100K rows. this solution will outperform others. – Shubham Sharma Feb 09 '23 at 04:21
  • While you are right that your appoach scales nicely with len(df) it does not scale very well with len(np.unique(np.concatenate(df['lst']))). I found a way to "vectorize in", see my own answer -- It was embarassingly simple :) – Helge Jensen Feb 09 '23 at 11:40
0

After some experiments, I found a way to do vectorization "right"™ :)

def vectorize_in(): 
  def isin(spec, list):
    return spec in list
  vec_in = np.vectorize(isin)
  return vec_in(df['spec'], df['lst'])
check(vectorize_in)
OK <function vectorize_in at 0x7fe2de337550>

This approach scales very well, both in len(df), max(len(df.lst)) and len(np.unique(np.concatenate(df['lst']))).

As seen by:

import numpy as np
import pandas as pd
rowcount = 10000

df = pd.DataFrame(data = {"spec": np.arange(0,rowcount)})
rand = np.random.default_rng(seed=42)
lst_mean = len(df)/2
lst_ln = 100
def mklist(x): 
  start = int(np.abs(rand.normal()*lst_mean))
  l = int(np.abs(rand.normal()*lst_ln))
  return np.arange(start-int(l/2), start+l)
df['lst']=df.apply(mklist, axis=1)

def explode_eq(): pd.DataFrame([*df['lst']]).eq(df['spec'], axis=0).any(axis=1)

%timeit result_apply()
234 ms ± 4.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit explode_eq()
1.16 s ± 222 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vectorize_in()
32.4 ms ± 743 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Helge Jensen
  • 133
  • 10