0

Is there a faster way to clean emojis from a cuDF string series?

I am currently using

emoji == 1.7.0

and retrieving the regex emoji patterns (since cuDF doesnt support the emoji library directly to do a emoji.get_emoji_regexp().sub("", string) apply on a series)

emoji_pattern = emoji.get_emoji_regexp()

then applying it over the cudf series

cu_df.post_content.str.replace(
    emoji_patterns,
    "", 
    regex = True
)

This operation however, is extremely slow, slower than if I had use multiprocessing on CPUs (~45mins vs ~1hour+). Anyone knows why, and what I am doing wrong?

ZooPanda
  • 331
  • 3
  • 11
  • Do you mind making a minimum reproducible that includes the data and better describe your dataset (like size, rows/columns, dtypes, etc)? Most helpers don't have a dataset with emojis lying around to test ;). – TaureanDyerNV Sep 21 '22 at 16:12

0 Answers0