I have two columns that are lists of characters. I am trying to extract the characters that are in common between the columns.
It looks like this:
shape: (20, 4)
┌────────────────────────────────┬─────────────────────┬─────────────────────┬─────────────────────┐
│ Original ┆ Compartment_1 ┆ Compartment_2 ┆ REGEX │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ list[str] ┆ list[str] ┆ str │
╞════════════════════════════════╪═════════════════════╪═════════════════════╪═════════════════════╡
│ DsPhSBQQQhqmBDhPDsFwjwsLjlRjlt ┆ ["D", "s", ... "F"] ┆ ["w", "j", ... "b"] ┆ wjwsLjlRjlttvjvvtRb │
│ tv... ┆ ┆ ┆ │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ rNJMNNbrHrtjHLHjvwtg ┆ ["r", "N", ... "r"] ┆ ["t", "j", ... "g"] ┆ tjHLHjvwtg │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ fNbNzZdrZnMnMPnQShFPDmnqFm ┆ ["f", "N", ... "M"] ┆ ["P", "n", ... "m"] ┆ PnQShFPDmnqFm │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ QWVCFfQffgQCVZzVVpHsHJBqtpspJF ┆ ["Q", "W", ... "V"] ┆ ["p", "H", ... "q"] ┆ pHsHJBqtpspJFRHqq │
│ RH... ┆ ┆ ┆ │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ZnJHRncHHgnrsrZffTdMdMBfmMvfvR ┆ ["Z", "n", ... "Z"] ┆ ["f", "f", ... "R"] ┆ ffTdMdMBfmMvfvR │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ NWWPnZrVHrZPCDDQtzDCPLCq ┆ ["N", "W", ... "P"] ┆ ["C", "D", ... "q"] ┆ CDDQtzDCPLCq │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ jpFjvBZhDFHZdwcmslcslBLLNl ┆ ["j", "p", ... "d"] ┆ ["w", "c", ... "l"] ┆ wcmslcslBLLNl │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ dVtTVVCzzfrrMPNLLcnVcPLRns ┆ ["d", "V", ... "M"] ┆ ["P", "N", ... "s"] ┆ PNLLcnVcPLRns │
└────────────────────────────────┴─────────────────────┴─────────────────────┴─────────────────────┘
I tried to do:
day3.select(
pl.col("Compartment_1").str.extract_all(pl.col("Compartment_2"))
But since extract_all takes a regex argument, that understandably failed.
So then I turned the Compartment_2
into a column called REGEX
hoping I could just pass that through too, but I keep getting Null's.
I also thought that it might just be that it is angry about Compartment_1 being a list column, so I tried using the arr.eval
, but that still didn't work for me:
day3_3 = day3_2.with_columns(
pl.col("Compartment_1")
.arr.eval(
pl.element()
.str.extract_all(pattern = str(pl.col("REGEX")))
).alias("Match")
)
Any tips??