Polars - effcient way for processing text

Question

I need to make a minimal cleansing process on text.

The clean is remove puncs, non alphabetical characters and keep only english text.

Currently I am using clean-text but I can use whatever.

I have several csv's files with text column.

I used apply but it run very slow,

Is there a better way(efficient) to make it done?

def clean_text(s):
    return clean(s, lower=True, lang='en', no_punct=True)

df.select(pl.col('text').apply(clean_text))

Just a side note: `lambda x: clean_text(x)` is the same as just `clean_text`... — Tomerikoo, Mar 09 '22 at 15:55

score 1 · Answer 1 · answered Mar 09 '22 at 18:49

We could do some expression kung fu so that we only would have to call that python lambda once.

Bear with me:

splitter = ""
df.select([
    pl.col("my_column").list().arr.join(splitter).apply(lambda x: clean(x, no_punct=True)).str.split(splitter).explode()
])

First we turn the column into a list column of one value with the list() expression
Then we use the join expression in the arr namespace to join with a splitter. This is some string data not in our data on which we split later.
Then we call the apply which takes in a single large string. (Note that we still need to convert it to a python string, so it is not cheap.
And then we split and explode to get a string column of our return values.

I found this to be ~5x faster locally.

Real performance

If you want optimal performance you can compile a function that takes a polars series and operates on the string data in Rust. Here is an example of that: https://github.com/pola-rs/polars/tree/master/examples/python_rust_compiled_function

I dont know why, But it runs slower. on 100K rows simple apply run within 7.5 secs and your solution in 10.5 secs. And by the way it's not return the expected results — MPA, Mar 10 '22 at 07:19