0

I am trying to find common elements in a column of list wrt a reference cell. I could accomplish it with a small dataset but I face two problems. The speed is excruciatingly slow even for 25 rows of sample data (20.7 s ± 52 ms per loop), and I unable to find a faster implementation through map which can use parallelization unlike apply that works on a single thread.

The version that I have working right now is as follows:

>>> import polars as pl
>>> import numpy as np
>>>
>>> df = pl.DataFrame({'animal': ['goat','tiger','goat','tiger','lion','goat','tiger','lion'], 'food': ['grass','rabbit','carrots','deer','zebra','water','water','water']})
>>> dl = df.groupby('animal').agg_list()
>>> dl
shape: (3, 2)
┌────────┬───────────────────────────────┐
│ animal ┆ food                          │
│ ---    ┆ ---                           │
│ str    ┆ list[str]                     │
╞════════╪═══════════════════════════════╡
│ lion   ┆ ["zebra", "water"]            │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ tiger  ┆ ["rabbit", "deer", "water"]   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ goat   ┆ ["grass", "carrots", "water"] │
└────────┴───────────────────────────────┘
>>>
>>> refn = dl['food'][1].to_numpy()
>>> dl['food'] = dl['food'].apply(lambda x: np.intersect1d(refn,x.to_numpy()))
>>> dl
shape: (3, 2)
┌────────┬───────────────────────────┐
│ animal ┆ food                      │
│ ---    ┆ ---                       │
│ str    ┆ object                    │
╞════════╪═══════════════════════════╡
│ lion   ┆ ['water']                 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ tiger  ┆ ['deer' 'rabbit' 'water'] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ goat   ┆ ['water']                 │
└────────┴───────────────────────────┘
>>>

Any help will be greatly appreciated. TIA.

Quiescent
  • 1,088
  • 7
  • 18

1 Answers1

1

If you are using python/numpy/sets for memberships in polars you are most of the time on the wrong path. This counts actually for using apply in general.

Let's start of where you left. You took the 1st index of column "food" which has dtype List. So that gives us a Series assigned to refn.

dl = df.groupby('animal', maintain_order=True).agg_list()
refn = dl['food'][1]
refn
shape: (3,)
Series: 'food' [str]
[
    "rabbit"
    "deer"
    "water"
]

Option 1 (slower)

This solution does an O(n) search in every elemnt, so this is likely not the fastest.

We simply filter the original df with membership of refn and the aggregate the result again.

(df.filter(pl.col("food").is_in(refn))
   .groupby("animal").agg_list()
)

Option 2 (fastest)

A better option, is using a semi join. A semi join filters the DataFrame by membership in another DataFrame, sounds like something we want!

(df.join(refn.to_frame(), on="food", how="semi")
   .groupby("animal").agg_list()
)
ritchie46
  • 10,405
  • 1
  • 24
  • 43
  • Thank you very much. I am new to Polars and there is a lot to learn. It (2nd approach) works on the example. I am missing some on my data in which the lists have over 10-20K strings. I am looking into it and get back in a while. – Quiescent Jun 19 '22 at 18:03
  • The second code works faster and was what I was looking for. Thank you. – Quiescent Jul 16 '22 at 20:50