6

I'm working with polars 0.13.46 for Python and I have a column with a list of Strings for which I need to check if a particular String occurs before another. I have created the following code example that works, but needs to break out of polars using apply, which makes it very slow.

import polars as pl
from polars import col

df = pl.DataFrame(
    {
        'str': ['A', 'B', 'C', 'B', 'A'],
        'group': [1,1,2,1,2]
    }
).lazy()

df_groups = df.groupby('group').agg([col('str').list().alias('str_list')])
print(df_groups.collect())

pre = 'A'
succ = 'B'

df_groups_filtered = df_groups.filter(
    col('str_list').apply(
        lambda str_list: 
            pre in str_list and succ in str_list and 
            str_list.to_list().index(pre) < str_list.to_list().index(succ)
    )
)

df_groups_filtered.collect()

This provides the desired result, which is only the first row of the two rows of the example data:

┌───────┬─────────────────┐
│ group ┆ str_list        │
│ ---   ┆ ---             │
│ i64   ┆ list[str]       │
╞═══════╪═════════════════╡
│ 1     ┆ ["A", "B", "B"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2     ┆ ["C", "A"]      │
└───────┴─────────────────┘

I know that I can do

df_groups_filtered = df_groups.filter(
  col('str_list').arr.contains(pre) & col('str_list').arr.contains(succ)
)

for the part of checking that both strings are contained, but I couldn't figure out how I can check the order in pure polars.

Are there ways to achieve this natively with polars?

4 Answers4

9

One way that we can solve this problem is to use the arr.eval expression. The arr.eval expression allows us to treat a list as if it were a Series/column, so that we can apply all the same expressions we accustomed to using.

The Algorithm

(
    df_groups
    .filter(
        pl.col("str_list")
        .arr.eval(
            pl.element().filter(
                ((pl.element() == succ).cumsum() == 0) & (pl.element() == pre)
            )
        )
        .arr.lengths() > 0
    )
    .collect()
    .filter(pl.col("str_list").arr.contains(succ))
)

shape: (1, 2)
┌───────┬─────────────────┐
│ group ┆ str_list        │
│ ---   ┆ ---             │
│ i64   ┆ list[str]       │
╞═══════╪═════════════════╡
│ 1     ┆ ["A", "B", "B"] │
└───────┴─────────────────┘

Note: there is currently a bug in Polars which causes an error when we use

.filter(pl.col("str_list").arr.contains(succ))

in lazy mode. (I'll file a bug report for that.)

How the algorithm works, in steps

The arr.eval expression allows us to treat a list as a Series/column, so that we can apply our usual toolkit of expressions to our problem.

That said, using arr.eval can seem a bit confusing at first, so we'll walk through this in steps.

As a Series

Let's first see how the algorithm works when our data is a Series/column, and then back into how we code this when the data is in lists.

Let's start with this data. We'll attempt to find any time that cat appears before dog.

df_groups = pl.DataFrame([
        pl.Series('cat_dog', ['aardvark', 'cat', 'mouse', 'dog', 'sloth', 'zebra']),
        pl.Series('dog_cat', ['aardvark', 'dog', 'mouse', 'cat', 'sloth', 'zebra']),
        pl.Series('cat_dog_cat', ['aardvark', 'cat', 'mouse', 'dog', 'monkey', 'cat']),
        pl.Series('dog_cat_dog', ['aardvark', 'dog', 'mouse', 'cat', 'monkey', 'dog']),
        pl.Series('no_dog', ['aardvark', 'cat', 'mouse', 'cat', 'monkey', 'zebra']),
        pl.Series('no_cat', ['aardvark', 'mouse', 'dog', 'monkey', 'dog', 'zebra']),
        pl.Series('neither', ['aardvark', 'mouse', 'tiger', 'zebra', 'sloth', 'zebra']),
])
df_groups
shape: (6, 7)
┌──────────┬──────────┬─────────────┬─────────────┬──────────┬──────────┬──────────┐
│ cat_dog  ┆ dog_cat  ┆ cat_dog_cat ┆ dog_cat_dog ┆ no_dog   ┆ no_cat   ┆ neither  │
│ ---      ┆ ---      ┆ ---         ┆ ---         ┆ ---      ┆ ---      ┆ ---      │
│ str      ┆ str      ┆ str         ┆ str         ┆ str      ┆ str      ┆ str      │
╞══════════╪══════════╪═════════════╪═════════════╪══════════╪══════════╪══════════╡
│ aardvark ┆ aardvark ┆ aardvark    ┆ aardvark    ┆ aardvark ┆ aardvark ┆ aardvark │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ cat      ┆ dog      ┆ cat         ┆ dog         ┆ cat      ┆ mouse    ┆ mouse    │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ mouse    ┆ mouse    ┆ mouse       ┆ mouse       ┆ mouse    ┆ dog      ┆ tiger    │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ dog      ┆ cat      ┆ dog         ┆ cat         ┆ cat      ┆ monkey   ┆ zebra    │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ sloth    ┆ sloth    ┆ monkey      ┆ monkey      ┆ monkey   ┆ dog      ┆ sloth    │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ zebra    ┆ zebra    ┆ cat         ┆ dog         ┆ zebra    ┆ zebra    ┆ zebra    │
└──────────┴──────────┴─────────────┴─────────────┴──────────┴──────────┴──────────┘

To detect our first occurrence of a dog, we'll use the cumsum expression on a boolean expression.

pre = "cat"
succ = "dog"
df_groups = df_groups.with_columns(
    (pl.all() == succ).cumsum().suffix('__cumsum')
)
df_groups.select(sorted(df_groups.columns))
shape: (6, 14)
┌──────────┬─────────────────┬─────────────┬─────────────────────┬──────────┬─────────────────┬─────────────┬─────────────────────┬──────────┬─────────────────┬──────────┬────────────────┬──────────┬────────────────┐
│ cat_dog  ┆ cat_dog__cumsum ┆ cat_dog_cat ┆ cat_dog_cat__cumsum ┆ dog_cat  ┆ dog_cat__cumsum ┆ dog_cat_dog ┆ dog_cat_dog__cumsum ┆ neither  ┆ neither__cumsum ┆ no_cat   ┆ no_cat__cumsum ┆ no_dog   ┆ no_dog__cumsum │
│ ---      ┆ ---             ┆ ---         ┆ ---                 ┆ ---      ┆ ---             ┆ ---         ┆ ---                 ┆ ---      ┆ ---             ┆ ---      ┆ ---            ┆ ---      ┆ ---            │
│ str      ┆ u32             ┆ str         ┆ u32                 ┆ str      ┆ u32             ┆ str         ┆ u32                 ┆ str      ┆ u32             ┆ str      ┆ u32            ┆ str      ┆ u32            │
╞══════════╪═════════════════╪═════════════╪═════════════════════╪══════════╪═════════════════╪═════════════╪═════════════════════╪══════════╪═════════════════╪══════════╪════════════════╪══════════╪════════════════╡
│ aardvark ┆ 0               ┆ aardvark    ┆ 0                   ┆ aardvark ┆ 0               ┆ aardvark    ┆ 0                   ┆ aardvark ┆ 0               ┆ aardvark ┆ 0              ┆ aardvark ┆ 0              │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ cat      ┆ 0               ┆ cat         ┆ 0                   ┆ dog      ┆ 1               ┆ dog         ┆ 1                   ┆ mouse    ┆ 0               ┆ mouse    ┆ 0              ┆ cat      ┆ 0              │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mouse    ┆ 0               ┆ mouse       ┆ 0                   ┆ mouse    ┆ 1               ┆ mouse       ┆ 1                   ┆ tiger    ┆ 0               ┆ dog      ┆ 1              ┆ mouse    ┆ 0              │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ dog      ┆ 1               ┆ dog         ┆ 1                   ┆ cat      ┆ 1               ┆ cat         ┆ 1                   ┆ zebra    ┆ 0               ┆ monkey   ┆ 1              ┆ cat      ┆ 0              │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ sloth    ┆ 1               ┆ monkey      ┆ 1                   ┆ sloth    ┆ 1               ┆ monkey      ┆ 1                   ┆ sloth    ┆ 0               ┆ dog      ┆ 2              ┆ monkey   ┆ 0              │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ zebra    ┆ 1               ┆ cat         ┆ 1                   ┆ zebra    ┆ 1               ┆ dog         ┆ 2                   ┆ zebra    ┆ 0               ┆ zebra    ┆ 2              ┆ zebra    ┆ 0              │
└──────────┴─────────────────┴─────────────┴─────────────────────┴──────────┴─────────────────┴─────────────┴─────────────────────┴──────────┴─────────────────┴──────────┴────────────────┴──────────┴────────────────┘

So our goal will be to see if there are any rows where the value is cat and where the cumsum on the dog expression is also zero.

Using lists and arr.eval

We'll take this in steps.

First, let's create some data as lists (instead of Series). I'm also going to eliminate the lazy mode, to reduce the clutter.

df_groups = pl.DataFrame({
    'group': ['cat_dog', 'dog_cat', 'cat_dog_cat', 'dog_cat_dog', 'no_dog', 'no_cat', 'neither'],
    'str_list': [
        ['aardvark', 'cat', 'mouse', 'dog'],
        ['aardvark', 'dog', 'mouse', 'cat'],
        ['aardvark', 'cat', 'mouse', 'dog', 'monkey', 'cat'],
        ['aardvark', 'dog', 'mouse', 'cat', 'monkey', 'dog'],
        ['aardvark', 'cat', 'mouse', 'cat', 'monkey'],
        ['aardvark', 'mouse', 'dog', 'monkey', 'dog'],
        ['aardvark', 'mouse', 'tiger', 'zebra'],
    ]
})
df_groups
shape: (7, 2)
┌─────────────┬─────────────────────────────────────┐
│ group       ┆ str_list                            │
│ ---         ┆ ---                                 │
│ str         ┆ list[str]                           │
╞═════════════╪═════════════════════════════════════╡
│ cat_dog     ┆ ["aardvark", "cat", ... "dog"]      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ dog_cat     ┆ ["aardvark", "dog", ... "cat"]      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ cat_dog_cat ┆ ["aardvark", "cat", ... "cat"]      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ dog_cat_dog ┆ ["aardvark", "dog", ... "dog"]      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ no_dog      ┆ ["aardvark", "cat", ... "monkey"... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ no_cat      ┆ ["aardvark", "mouse", ... "dog"]    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ neither     ┆ ["aardvark", "mouse", ... "zebra... │
└─────────────┴─────────────────────────────────────┘

Now, let's look for any list where cat appears before a dog (i.e., the cumsum value on dog is zero).

pre = "cat"
succ = "dog"
(
    df_groups
    .with_columns(
        pl.col("str_list")
        .arr.eval(
            pl.element().filter(
                ((pl.element() == succ).cumsum() == 0) & (pl.element() == pre)
            )
        )
    )
)
shape: (7, 2)
┌─────────────┬────────────────┐
│ group       ┆ str_list       │
│ ---         ┆ ---            │
│ str         ┆ list[str]      │
╞═════════════╪════════════════╡
│ cat_dog     ┆ ["cat"]        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ dog_cat     ┆ []             │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ cat_dog_cat ┆ ["cat"]        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ dog_cat_dog ┆ []             │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ no_dog      ┆ ["cat", "cat"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ no_cat      ┆ []             │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ neither     ┆ []             │
└─────────────┴────────────────┘

We see that only three lists have a cat before a dog appears.

Next, we'll change the with_columns to a filter to keep only those rows where we found one or more cat before a dog.

(
    df_groups
    .filter(
        pl.col("str_list")
        .arr.eval(
            pl.element().filter(
                ((pl.element() == succ).cumsum() == 0) & (pl.element() == pre)
            )
        )
        .arr.lengths() > 0
    )
)
shape: (3, 2)
┌─────────────┬─────────────────────────────────────┐
│ group       ┆ str_list                            │
│ ---         ┆ ---                                 │
│ str         ┆ list[str]                           │
╞═════════════╪═════════════════════════════════════╡
│ cat_dog     ┆ ["aardvark", "cat", ... "dog"]      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ cat_dog_cat ┆ ["aardvark", "cat", ... "cat"]      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ no_dog      ┆ ["aardvark", "cat", ... "monkey"... │
└─────────────┴─────────────────────────────────────┘

And finally, we need to eliminate any rows where a dog never appears.

(
    df_groups
    .filter(pl.col("str_list").arr.contains(succ))
    .filter(
        pl.col("str_list")
        .arr.eval(
            pl.element().filter(
                ((pl.element() == succ).cumsum() == 0) & (pl.element() == pre)
            )
        )
        .arr.lengths() > 0
    )
)
shape: (2, 2)
┌─────────────┬────────────────────────────────┐
│ group       ┆ str_list                       │
│ ---         ┆ ---                            │
│ str         ┆ list[str]                      │
╞═════════════╪════════════════════════════════╡
│ cat_dog     ┆ ["aardvark", "cat", ... "dog"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ cat_dog_cat ┆ ["aardvark", "cat", ... "cat"] │
└─────────────┴────────────────────────────────┘
  • Thanks for this fast and elegant solution as well as the very detailed explanation, @cbilot! – Thomas Lutterbeck Jun 23 '22 at 12:29
  • I should probably mention: the `arr.eval` has a `parallel` keyword. You might try setting this to `parallel=True` to see if it yields faster performance. –  Jun 23 '22 at 13:02
  • Good point. For the smaller data set that I'm currently operating on it doesn't make it difference. I assume it will make a difference once the data grows. – Thomas Lutterbeck Jun 24 '22 at 04:39
2

It's not the most easy code to process, but this should work reasonably well. I will see we can add an arg_where expression.

If we want to do this fast, we have to create a temporary dummy variable for now (Until we eliminate subexpressions) in polars.

For every element in the list, we filter the values that are not pre or succ (Or more values, I made it generic to work with n number of values).

This should leave these lists:

shape: (2,)
Series: 'dummy' [list]
[
    ["A"]
    ["A", "B"]
]

Then in the filter operation we use a fold to create a boolean predicate unfolded as:

dummy[0] == order[0] & dummy[1] == order[1]

We could write this out, which would be a bit more readable, but then it would not work on n number of elements.

pre = 'A'
succ = 'B'

order = [pre, succ]


# we first compute a dummy 
# we do that sequentally because it can be expensive to compute it multiple times.
df_groups.with_columns([
    
    # we use arr.eval
    # and run the search in parallel
    pl.col("str_list").arr.eval(
        expr=pl.element().filter(pl.element().is_in(order)).head(2),
        parallel=True
    ).alias("dummy"),
]).filter(
    # we use a fold because this generic for any number of elements
    pl.fold(acc=True, 
           f=lambda acc, e: acc & e, 
           exprs=[pl.col("dummy").arr.get(i) == order[i] for i in range(0, len(order))]
           )
).drop("dummy")
ritchie46
  • 10,405
  • 1
  • 24
  • 43
  • Thanks Ritchie! I had to change the expression to only work on unique elements to also support cases like ["A", "A", "B"]. It's definitely faster than the `apply` version, but slightly slower that cbilot's suggestion. So I will go with his solution. – Thomas Lutterbeck Jun 23 '22 at 12:27
  • There’s a new `arg_where` function that may make this even faster. You may want to check that out in upcoming releases. https://github.com/pola-rs/polars/pull/3757 –  Jun 23 '22 at 12:58
1

Here's another way to do it that uses explode, pivot, and join as the primary strategy.

If we start from the same cat/dog df as @user18559875 then we get

df_groups = pl.DataFrame({
    'group': ['cat_dog', 'dog_cat', 'cat_dog_cat', 'dog_cat_dog', 'no_dog', 'no_cat', 'neither'],
    'str_list': [
        ['aardvark', 'cat', 'mouse', 'dog'],
        ['aardvark', 'dog', 'mouse', 'cat'],
        ['aardvark', 'cat', 'mouse', 'dog', 'monkey', 'cat'],
        ['aardvark', 'dog', 'mouse', 'cat', 'monkey', 'dog'],
        ['aardvark', 'cat', 'mouse', 'cat', 'monkey'],
        ['aardvark', 'mouse', 'dog', 'monkey', 'dog'],
        ['aardvark', 'mouse', 'tiger', 'zebra'],
    ]
})

Taking this step by step, first we want an id column so we use with_row_count

df_groups.with_row_count('id')

We then want to use explode to convert the str_list into a bigger df where each value of that nested list becomes its own row

df_groups.with_row_count('id').explode('str_list')

from here we add another index to capture the ordering of the list

df_groups.with_row_count('id').explode('str_list').with_row_count('order')

Now we do a groupby on our first id and our str_list column which we'll then aggregate by the minimum value of order (ie, get the index value of the underlying list)

df_groups.with_row_count('id').explode('str_list').with_row_count('order') \
         .groupby(['id','str_list']).agg(pl.col('order').min())

Since we only care about the relative position of dogs and cats, we filter to only get dogs and cats, which we could/should actually do as early as possible which would be right after the explode (I'm putting it after the with_row_count in case the true ordinal position matters.

df_groups.with_row_count('id').explode('str_list').with_row_count('order') \
         .filter(pl.col('str_list').is_in(['cat','dog'])) \
         .groupby(['id','str_list']).agg(pl.col('order').min())

The next thing we want is to filter for when cat is less than dog so we pivot so there's a dog and a cat column from which we can simply filter for when cat<dog and/or when dog isn't present.

df_groups.with_row_count('id').explode('str_list').with_row_count('order') \
    .filter(pl.col('str_list').is_in(['cat','dog'])) \
    .groupby(['id','str_list']).agg(pl.col('order').min()) \
    .pivot(index='id', columns='str_list', values='order') \
    .filter((pl.col('cat')<pl.col('dog')) | (pl.col('dog').is_null()))

but that's not what you actually want, it just gives you the row of what you want so we just select the id column and join it back to the original.

df_groups.with_row_count('id').join(
    df_groups.with_row_count('id').explode('str_list').with_row_count('order') \
        .filter(pl.col('str_list').is_in(['cat','dog'])) \
        .groupby(['id','str_list']).agg(pl.col('order').min()) \
        .pivot(index='id', columns='str_list', values='order') \
        .filter((pl.col('cat')<pl.col('dog')) | (pl.col('dog').is_null())).select('id'),
    on='id').select(pl.exclude('id'))
Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
1

I'd like to add another solution that I actually ended up adopting. Credits go to @mcrumiller, who posted this on Github.

import polars as pl
from polars import col, when

df = pl.DataFrame({
    "id": [1,2,3,4],
    "list": [['A', 'B'], ['B', 'A'], ['C', 'B', 'D'], ['D', 'A', 'C', 'B']]
})

def loc_of(value):
    # only execute if the item is contained in the list
    return when(col("list").arr.contains(value)).then(
        col("list").arr.eval(
            # create array of True/False, then cast to 1's and 0's
            # arg_max() then finds the first occurrence of 1, i.e. the first occurrence of value
            (pl.element() == value).cast(pl.UInt8).arg_max(),
            parallel=True
        ).arr.first()
    ).otherwise(None) # return null if not found

df.filter(loc_of('A') < loc_of('B'))

I really like the simplicity of this approach. Performance-wise it is very similar to the approach of @user18559875.