0

I'm using sparkNLP version 3.2.3 and trying to tokenize some text. I've used spacy and other tokenizers that handle contractions such as "they're" by splitting it into "they" and "'re". According to this resource pages 105-107 sparkNLP should tokenize that way as well: https://books.google.com/books?id=5DDtDwAAQBAJ&pg=PA106&lpg=PA106&dq=spark+nlp+tokenizer+contractions&source=bl&ots=5bao0SzjQ7&sig=ACfU3U1pklNa8NNElLk_tX48tMKHuFGViA&hl=en&sa=X&ved=2ahUKEwij6abZ29bzAhU0CjQIHaIkAE4Q6AF6BAgUEAM#v=onepage&q=spark%20nlp%20tokenizer%20contractions&f=false

However when I actually run some contractions through sparkNLP tokenization it does not break them apart. Any ideas what might be up? I want to use this package for other reasons and so would not like to swap between spacy or NLTK and this.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

data = spark.createDataFrame([["They're fine."]]).toDF("text")
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token").fit(data)

pipeline = Pipeline().setStages([documentAssembler, tokenizer]).fit(data)
result = pipeline.transform(data)

result.selectExpr("token.result").show(truncate=False)
+------------------+
|result            |
+------------------+
|[They're, fine, .]|
+------------------+
user3242036
  • 645
  • 1
  • 7
  • 16

1 Answers1

0

The book is simply not up-to-date with the default behaviour (and I'd also wish the documentation itself was more thorough). Take a look at the annotators.Tokenizer interface and defaults here.

From my understanding, the way you would handle contractions in your desired way is by modifying the suffix pattern.

The suffixPattern defaults to ([^\s\w]?)([^\s\w]*)\z (according to docstring for version 3.2.3). By changing this to ('re)\z (you would need to adapt the pattern to your needs), you can achieve the following:

toker = Pipeline(stages=[
    DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document"),
    Tokenizer()\
        .setInputCols(["document"])\
        .setOutputCol("tokens")\
        .setSuffixPattern(r"('re)\z")
])
toker_m = toker.fit(sql.createDataFrame(pd.DataFrame([{"text": ""}])))
toker_lm = LightPipeline(toker_m)
toker_lm.fullAnnotate("They're fine.")

which gives:

[{'document': [Annotation(document, 0, 12, They're fine., {})],
  'tokens': [Annotation(token, 0, 3, They, {'sentence': '0'}),
   Annotation(token, 4, 6, 're, {'sentence': '0'}),
   Annotation(token, 8, 12, fine., {'sentence': '0'})]}]
sim
  • 1,227
  • 14
  • 20