1

I'm using the Spark NLP pipeline to preprocess my data. Instead of only removing punctuation, the normalizer also removes umlauts.

My code:

documentAssembler = DocumentAssembler() \
    .setInputCol("column") \
    .setOutputCol("column_document")\
    .setCleanupMode('shrink_full')

tokenizer = Tokenizer() \
  .setInputCols(["column_document"]) \
  .setOutputCol("column_token") \
  .setMinLength(2)\
  .setMaxLength(30)
  
normalizer = Normalizer() \
    .setInputCols(["column_token"]) \
    .setOutputCol("column_normalized")\
    .setCleanupPatterns(["[^\w -]|_|-(?!\w)|(?<!\w)-"])\
    .setLowercase(True)\

Example:

Ich esse gerne Äpfel vom Biobauernhof Reutter-Müller, die schmecken besonders gut!

Output:

Ich esse gerne pfel vom Biobauernhof Reutter Mller die schmecken besonders gut

Expected Output:

Ich esse gerne Äpfel vom Biobauernhof Reutter-Müller die schmecken besonders gut
jonas
  • 392
  • 2
  • 13

1 Answers1

2

The \w pattern is not Unicode-aware by default, you need to make it Unicode-aware with a regex option. In this case, it is easier to do it with an embedded flag option (?U):

"(?U)[^\w -]|_|-(?!\w)|(?<!\w)-"

More details from the documentation:

When this flag is specified then the (US-ASCII only) Predefined character classes and POSIX character classes are in conformance with Unicode Technical Standard #18: Unicode Regular Expression Annex C: Compatibility Properties.

The UNICODE_CHARACTER_CLASS mode can also be enabled via the embedded flag expression (?U).

The flag implies UNICODE_CASE, that is, it enables Unicode-aware case folding.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563