I am trying to produce n-grams of 3 letters but Spark NGram
inserts a white space between each letter. I want to remove (or not produce) this white space. I could explode the array, remove the white space, then reassemble the array, but this is would be a very expensive operation. Preferably, I also want to avoid creating UDFs due to performance issues with PySpark UDFs. Is there a cheaper way to do this using the PySpark built-in functions?
from pyspark.ml import Pipeline, Model, PipelineModel
from pyspark.ml.feature import Tokenizer, RegexTokenizer, StopWordsRemover, NGram
from pyspark.sql.functions import *
wordDataFrame = spark.createDataFrame([
(0, "Hello I heard about Spark"),
(1, "I wish Java could use case classes"),
(2, "Logistic regression models are neat")
], ["id", "words"])
pipeline = Pipeline(stages=[
RegexTokenizer(pattern="", inputCol="words", outputCol="tokens", minTokenLength=1),
NGram(n=3, inputCol="tokens", outputCol="ngrams")
])
model = pipeline.fit(wordDataFrame).transform(wordDataFrame)
model.show()
The current output is:
+---+--------------------+--------------------+--------------------+
| id| words| tokens| ngrams|
+---+--------------------+--------------------+--------------------+
| 0|Hi I heard about ...|[h, e, l, l, o, ...|[h e l, e l l, ...|
+---+--------------------+--------------------+--------------------+
but what is desired is:
+---+--------------------+--------------------+--------------------+
| id| words| tokens| ngrams|
+---+--------------------+--------------------+--------------------+
| 0|Hello I heard ab ...|[h, e, l, l, o, ...|[hel, ell, llo, ...|
+---+--------------------+--------------------+--------------------+