How to str.extract_all() in pypolars with REGEX column

Question

I have two columns that are lists of characters. I am trying to extract the characters that are in common between the columns.

It looks like this:

shape: (20, 4)
┌────────────────────────────────┬─────────────────────┬─────────────────────┬─────────────────────┐
│ Original                       ┆ Compartment_1       ┆ Compartment_2       ┆ REGEX               │
│ ---                            ┆ ---                 ┆ ---                 ┆ ---                 │
│ str                            ┆ list[str]           ┆ list[str]           ┆ str                 │
╞════════════════════════════════╪═════════════════════╪═════════════════════╪═════════════════════╡
│ DsPhSBQQQhqmBDhPDsFwjwsLjlRjlt ┆ ["D", "s", ... "F"] ┆ ["w", "j", ... "b"] ┆ wjwsLjlRjlttvjvvtRb │
│ tv...                          ┆                     ┆                     ┆                     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ rNJMNNbrHrtjHLHjvwtg           ┆ ["r", "N", ... "r"] ┆ ["t", "j", ... "g"] ┆ tjHLHjvwtg          │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ fNbNzZdrZnMnMPnQShFPDmnqFm     ┆ ["f", "N", ... "M"] ┆ ["P", "n", ... "m"] ┆ PnQShFPDmnqFm       │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ QWVCFfQffgQCVZzVVpHsHJBqtpspJF ┆ ["Q", "W", ... "V"] ┆ ["p", "H", ... "q"] ┆ pHsHJBqtpspJFRHqq   │
│ RH...                          ┆                     ┆                     ┆                     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...                            ┆ ...                 ┆ ...                 ┆ ...                 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ZnJHRncHHgnrsrZffTdMdMBfmMvfvR ┆ ["Z", "n", ... "Z"] ┆ ["f", "f", ... "R"] ┆ ffTdMdMBfmMvfvR     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ NWWPnZrVHrZPCDDQtzDCPLCq       ┆ ["N", "W", ... "P"] ┆ ["C", "D", ... "q"] ┆ CDDQtzDCPLCq        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ jpFjvBZhDFHZdwcmslcslBLLNl     ┆ ["j", "p", ... "d"] ┆ ["w", "c", ... "l"] ┆ wcmslcslBLLNl       │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ dVtTVVCzzfrrMPNLLcnVcPLRns     ┆ ["d", "V", ... "M"] ┆ ["P", "N", ... "s"] ┆ PNLLcnVcPLRns       │
└────────────────────────────────┴─────────────────────┴─────────────────────┴─────────────────────┘

I tried to do:

day3.select(
    pl.col("Compartment_1").str.extract_all(pl.col("Compartment_2"))

But since extract_all takes a regex argument, that understandably failed.

So then I turned the Compartment_2 into a column called REGEX hoping I could just pass that through too, but I keep getting Null's.

I also thought that it might just be that it is angry about Compartment_1 being a list column, so I tried using the arr.eval, but that still didn't work for me:

day3_3 = day3_2.with_columns(
    pl.col("Compartment_1")
    .arr.eval(
        pl.element()
        .str.extract_all(pattern = str(pl.col("REGEX")))
    ).alias("Match")
)

Any tips??

Providing code that builds your dataframe makes it much easier to help. `df = ...` - As for the problem - it looks like you want the intersection of list columns? https://stackoverflow.com/questions/72871905/polars-intersection-of-list-columns-in-dataframe — jqurious, Dec 03 '22 at 19:45
Sorry about that. Yes, their solution there worked. I was looking everywhere and I guess I just didn't type in the right keywords to find that solution. Much appreciated. — Damon C. Roberts, Dec 03 '22 at 19:59

score 1 · Answer 1 · answered Dec 08 '22 at 07:34

1

As of next polars release (0.15.3), polars will accept expressions in str.extract_all, meaning that you can do:

day3.select(
    pl.col("Compartment_1").str.extract_all(pl.col("Compartment_2"))
)

as you initially thought.

answered Dec 08 '22 at 07:34

ritchie46

10,405
1
24
43

How to str.extract_all() in pypolars with REGEX column

1 Answers1