I'm using sparkNLP version 3.2.3 and trying to tokenize some text. I've used spacy and other tokenizers that handle contractions such as "they're" by splitting it into "they" and "'re". According to this resource pages 105-107 sparkNLP should tokenize that way as well: https://books.google.com/books?id=5DDtDwAAQBAJ&pg=PA106&lpg=PA106&dq=spark+nlp+tokenizer+contractions&source=bl&ots=5bao0SzjQ7&sig=ACfU3U1pklNa8NNElLk_tX48tMKHuFGViA&hl=en&sa=X&ved=2ahUKEwij6abZ29bzAhU0CjQIHaIkAE4Q6AF6BAgUEAM#v=onepage&q=spark%20nlp%20tokenizer%20contractions&f=false
However when I actually run some contractions through sparkNLP tokenization it does not break them apart. Any ideas what might be up? I want to use this package for other reasons and so would not like to swap between spacy or NLTK and this.
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
data = spark.createDataFrame([["They're fine."]]).toDF("text")
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token").fit(data)
pipeline = Pipeline().setStages([documentAssembler, tokenizer]).fit(data)
result = pipeline.transform(data)
result.selectExpr("token.result").show(truncate=False)
+------------------+
|result |
+------------------+
|[They're, fine, .]|
+------------------+