2

I have a polars DataFrame for example:

>>> df = pl.DataFrame({'A': ['a', 'b', 'c', 'd'], 'B': ['app', 'nop', 'cap', 'tab']})
>>> df
shape: (4, 2)
┌─────┬─────┐
│ A   ┆ B   │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪═════╡
│ a   ┆ app │
│ b   ┆ nop │
│ c   ┆ cap │
│ d   ┆ tab │
└─────┴─────┘

I'm trying to get a third column C which is True if strings in column B starts with the strings in column A of the same row, otherwise, False. So in the case above, I'd expect:

┌─────┬─────┬───────┐
│ A   ┆ B   ┆ C     │
│ --- ┆ --- ┆ ---   │
│ str ┆ str ┆ bool  │
╞═════╪═════╪═══════╡
│ a   ┆ app ┆ true  │
│ b   ┆ nop ┆ false │
│ c   ┆ cap ┆ true  │
│ d   ┆ tab ┆ false │
└─────┴─────┴───────┘

I'm aware of the df['B'].str.starts_with() function but passing in a column yielded:

>>> df['B'].str.starts_with(pl.col('A'))
...  # Some stuff here.
TypeError: argument 'sub': 'Expr' object cannot be converted to 'PyString'

What's the way to do this? In pandas, you would do:

df.apply(lambda d: d['B'].startswith(d['A']), axis=1)
  • 1
    I am just starting to learn polars and there may be other ways, but I think we can compare them in their own slices. `df.with_column( (pl.col('B').str.slice(0,1) == pl.col('A').str.slice(0,1)).alias('bool_') )` – r-beginners Jan 16 '23 at 12:01
  • @r-beginners This is a good start, what I want to do is a little more complicated, hence why I want to use the `starts_with` function since column A could be longer strings – Syafiq Kamarul Azman Jan 16 '23 at 12:29
  • 1
    It looks like only a couple of the regex methods in the `.str` namespace [are currently set up to accept expressions.](https://github.com/pola-rs/polars/blob/master/py-polars/polars/internals/expr/string.py#L580) Perhaps this should be filed as a [feature request.](https://github.com/pola-rs/polars/issues) – jqurious Jan 17 '23 at 13:01

3 Answers3

4

This feature was added in polars 0.15.17

>>> df.with_columns(pl.col("B").str.starts_with(pl.col("A")).alias("C"))
shape: (4, 3)
┌─────┬─────┬───────┐
│ A   | B   | C     │
│ --- | --- | ---   │
│ str | str | bool  │
╞═════╪═════╪═══════╡
│ a   | app | true  │
├─────┼─────┼───────┤
│ b   | nop | false │
├─────┼─────┼───────┤
│ c   | cap | true  │
├─────┼─────┼───────┤
│ d   | tab | false │
└─────┴─────┴───────┘
jqurious
  • 9,953
  • 1
  • 4
  • 14
0

Okay after toying around for a bit, this works but I'm pretty sure uses Python strings in the back (based on the function name startswith) and therefore is not optimized:

>>> pl.concat((df, df.apply(lambda d: d[1].startswith(d[0]))))
shape: (4, 3)
┌─────┬─────┬───────┐
│ A   ┆ B   ┆ apply │
│ --- ┆ --- ┆ ---   │
│ str ┆ str ┆ bool  │
╞═════╪═════╪═══════╡
│ a   ┆ app ┆ true  │
│ b   ┆ nop ┆ false │
│ c   ┆ cap ┆ true  │
│ d   ┆ tab ┆ false │
└─────┴─────┴───────┘

I'll put up a feature request on Polars to see if this can be improved.

0

Using struct is another option if polars>=0.13.16. This approach, however, also uses str.startswith like this answer, instead of polars.Expr.str.starts_with.

Code:

import polars as pl

df = pl.DataFrame({'A': ['a', 'b', 'c', 'd'], 'B': ['app', 'nop', 'cap', 'tab']})

df.with_column(
    pl.struct(['A', 'B']).apply(lambda r: r['B'].startswith(r['A'])).alias('C')
)

Output:

┌─────┬─────┬───────┐
│ A   ┆ B   ┆ C     │
│ --- ┆ --- ┆ ---   │
│ str ┆ str ┆ bool  │
╞═════╪═════╪═══════╡
│ a   ┆ app ┆ true  │
│ b   ┆ nop ┆ false │
│ c   ┆ cap ┆ true  │
│ d   ┆ tab ┆ false │
└─────┴─────┴───────┘

Reference:

How to write polars custom apply function that does the processing row by row?

quasi-human
  • 1,898
  • 1
  • 2
  • 13