1

I need to make a minimal cleansing process on text.

The clean is remove puncs, non alphabetical characters and keep only english text.

Currently I am using clean-text but I can use whatever.

I have several csv's files with text column.

I used apply but it run very slow,

Is there a better way(efficient) to make it done?

def clean_text(s):
    return clean(s, lower=True, lang='en', no_punct=True)

df.select(pl.col('text').apply(clean_text))
MPA
  • 1,011
  • 7
  • 22

1 Answers1

1

We could do some expression kung fu so that we only would have to call that python lambda once.

Bear with me:

splitter = ""
df.select([
    pl.col("my_column").list().arr.join(splitter).apply(lambda x: clean(x, no_punct=True)).str.split(splitter).explode()
])
  • First we turn the column into a list column of one value with the list() expression
  • Then we use the join expression in the arr namespace to join with a splitter. This is some string data not in our data on which we split later.
  • Then we call the apply which takes in a single large string. (Note that we still need to convert it to a python string, so it is not cheap.
  • And then we split and explode to get a string column of our return values.

I found this to be ~5x faster locally.

Real performance

If you want optimal performance you can compile a function that takes a polars series and operates on the string data in Rust. Here is an example of that: https://github.com/pola-rs/polars/tree/master/examples/python_rust_compiled_function

ritchie46
  • 10,405
  • 1
  • 24
  • 43
  • I dont know why, But it runs slower. on 100K rows simple apply run within 7.5 secs and your solution in 10.5 secs. And by the way it's not return the expected results – MPA Mar 10 '22 at 07:19