Python Polars regex - remove non english, keep numbers punctuations and emojis

Question

I have python code for the task.

import re
import string

emoji_pat = '[\U0001F300-\U0001F64F\U0001F680-\U0001F6FF\u2600-\u26FF\u2700-\u27BF]'
shrink_whitespace_reg = re.compile(r'\s{2,}')

def clean_text(raw_text):
    reg = re.compile(r'({})|[^a-zA-Z0-9 -{}]'.format(emoji_pat,r"\\".join(list(string.punctuation)))) # line a
    result = reg.sub(lambda x: ' {} '.format(x.group(1)) if x.group(1) else ' ', raw_text)
    return shrink_whitespace_reg.sub(' ', result).lower()

I tried to use the polars polars.internals.series.StringNameSpace.contains

But I got an exceptions 
ComputeError: regex error: Syntax(

regex parse error:
    ([--☀-⛿✀-➿])|[^a-zA-Z0-9 -!\\"\\#\\$\\%\\&\\'\\(\\)\\*\\+\\,\\-\\.\\/\\:\\;\\<\\=\\>\\?\\@\\[\\\\\]\\^\\_\\`\\{\\}\\~]
                     ^^
error: unclosed character class

Examples with chinese english and unknown

texts = ['水虫対策にはコレが一番ですね','','I love polars!-ã„ã¤ã‚‚ã•ã‚‰ã•ã‚‰.','So good .']
df = pd.DataFrame({'text':texts})

d = df.text.apply(clean_text)

expected:

0                    
1                  
2    i love polars! .
3         so good  .
Name: text, dtype: object

Another question:

Is it faster than use re?

Can you update your question with some example data? And note that you can use three backticks and `python` to better format your code: https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks — ritchie46, Mar 16 '22 at 10:22
Your regex is incorrect. You can test your regex correctness here: https://rustexp.lpil.uk/ To answer your question on performance, yes it would be a lot faster. You don't run custom python code and currently you are compiling your regex pattern on every function call. — ritchie46, Mar 16 '22 at 12:25
@ritchie46 - why is it not correct? the results are good except that I have extra white spaced — MPA, Mar 16 '22 at 12:27
I gives a regex parser error. You regex pattern is very large, so I haven't taken the time where it goes wrong. — ritchie46, Mar 16 '22 at 13:29
@ritchie46 - fair enough :). thats the reason I have opened the issue. I am assume the problem come from the back slashes when I am trying to add the punctuations to the regex — MPA, Mar 16 '22 at 15:38

score 0 · Accepted Answer · answered Mar 21 '22 at 16:02

import polars as pl

emoji_pat = "[\U0001F300-\U0001F64F\U0001F680-\U0001F6FF\u2600-\u26FF\u2700-\u27BF]"

texts = ['水虫対策にはコレが一番ですね','','I |love|  polars!-ã„ã¤ã‚‚ã•ã‚‰ã•ã‚‰.','So good       .']

df = pl.DataFrame(pl.Series("text", texts))

In [78]: df
Out[78]:
shape: (4, 1)
┌─────────────────────────────────────┐
│ text                                │
│ ---                                 │
│ str                                 │
╞═════════════════════════════════════╡
│ 水虫対策にはコレが一番ですね        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│                                 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ I |love|  polars!-ã„ã¤ã‚‚ã•ã‚‰ã•... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ So good       .                   │
└─────────────────────────────────────┘

# Add cleaned column (rust regex requires "[" inside [] to be escaped).
df_cleaned = df.with_column(
    pl.col("text").str.replace_all(
        "[^a-zA-Z0-9 " + string.punctuation.replace("[", "\[") + emoji_pat + "]+",
        ""
    ).str.replace_all(
        "\s{2,}", " "
    ).str.to_lowercase().alias("text_cleaned")
)

In[79]: df_cleaned
Out[79]:
shape: (4, 2)
┌─────────────────────────────────────┬────────────────────┐
│ text                                ┆ text_cleaned       │
│ ---                                 ┆ ---                │
│ str                                 ┆ str                │
╞═════════════════════════════════════╪════════════════════╡
│ 水虫対策にはコレが一番ですね        ┆                    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│                                 ┆                │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ I |love|  polars!-ã„ã¤ã‚‚ã•ã‚‰ã•... ┆ i |love| polars!-. │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ So [good]       .                 ┆ so [good]  .     │
└─────────────────────────────────────┴────────────────────┘

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). — Community, Mar 21 '22 at 18:44
Please,Do you have a reference to your comment? - "# Add cleaned column (rust regex requires "[" inside [] to be escaped)." — MPA, Mar 22 '22 at 17:10

Python Polars regex - remove non english, keep numbers punctuations and emojis

1 Answers1