Regex in Spark NLP Normalizer is not working correctly

Question

I'm using the Spark NLP pipeline to preprocess my data. Instead of only removing punctuation, the normalizer also removes umlauts.

My code:

documentAssembler = DocumentAssembler() \
    .setInputCol("column") \
    .setOutputCol("column_document")\
    .setCleanupMode('shrink_full')

tokenizer = Tokenizer() \
  .setInputCols(["column_document"]) \
  .setOutputCol("column_token") \
  .setMinLength(2)\
  .setMaxLength(30)
  
normalizer = Normalizer() \
    .setInputCols(["column_token"]) \
    .setOutputCol("column_normalized")\
    .setCleanupPatterns(["[^\w -]|_|-(?!\w)|(?<!\w)-"])\
    .setLowercase(True)\

Example:

Ich esse gerne Äpfel vom Biobauernhof Reutter-Müller, die schmecken besonders gut!

Output:

Ich esse gerne pfel vom Biobauernhof Reutter Mller die schmecken besonders gut

Expected Output:

Ich esse gerne Äpfel vom Biobauernhof Reutter-Müller die schmecken besonders gut

It's working. Thanks a lot! Could you give me a small explanation why it's working with "(?U)"? — jonas, Sep 01 '21 at 10:17

score 2 · Accepted Answer · answered Sep 01 '21 at 10:19

The \w pattern is not Unicode-aware by default, you need to make it Unicode-aware with a regex option. In this case, it is easier to do it with an embedded flag option (?U):

"(?U)[^\w -]|_|-(?!\w)|(?<!\w)-"

Regex in Spark NLP Normalizer is not working correctly

1 Answers1