I need to make a minimal cleansing process on text.
The clean is remove puncs, non alphabetical characters and keep only english text.
Currently I am using clean-text but I can use whatever.
I have several csv's files with text column.
I used apply
but it run very slow,
Is there a better way(efficient) to make it done?
def clean_text(s):
return clean(s, lower=True, lang='en', no_punct=True)
df.select(pl.col('text').apply(clean_text))