-1

I am trying to produce n-grams of 3 letters but Spark NGram inserts a white space between each letter. I want to remove (or not produce) this white space. I could explode the array, remove the white space, then reassemble the array, but this is would be a very expensive operation. Preferably, I also want to avoid creating UDFs due to performance issues with PySpark UDFs. Is there a cheaper way to do this using the PySpark built-in functions?

from pyspark.ml import Pipeline, Model, PipelineModel
from pyspark.ml.feature import Tokenizer, RegexTokenizer, StopWordsRemover, NGram
from pyspark.sql.functions import *


wordDataFrame = spark.createDataFrame([
    (0, "Hello I heard about Spark"),
    (1, "I wish Java could use case classes"),
    (2, "Logistic regression models are neat")
], ["id", "words"])

pipeline = Pipeline(stages=[
        RegexTokenizer(pattern="", inputCol="words", outputCol="tokens", minTokenLength=1),
        NGram(n=3, inputCol="tokens", outputCol="ngrams")
    ])

model = pipeline.fit(wordDataFrame).transform(wordDataFrame)

model.show()

The current output is:

+---+--------------------+--------------------+--------------------+
| id|               words|              tokens|              ngrams|
+---+--------------------+--------------------+--------------------+
|  0|Hi I heard about ...|[h, e, l, l, o,  ...|[h e l, e l l,   ...|
+---+--------------------+--------------------+--------------------+

but what is desired is:

+---+--------------------+--------------------+--------------------+
| id|               words|              tokens|              ngrams|
+---+--------------------+--------------------+--------------------+
|  0|Hello I heard ab ...|[h, e, l, l, o,  ...|[hel, ell, llo,  ...|
+---+--------------------+--------------------+--------------------+
Béatrice Moissinac
  • 934
  • 2
  • 16
  • 41

1 Answers1

2

You can achieve this using higher order function transform and regex.(spark2.4+) ( assuming ngarms column is of arraytype with stringtype )

#sampledataframe
df.show()
+---+----------------+---------------+--------------+
| id|           words|         tokens|        ngrams|
+---+----------------+---------------+--------------+
|  0|Hi I heard about|[h, e, l, l, o]|[h e l, e l l]|
+---+----------------+---------------+--------------+

from pyspark.sql import functions as F
df.withColumn("ngrams", F.expr("""transform(ngrams,x-> regexp_replace(x,"\ ",""))""")).show()

+---+----------------+---------------+----------+
| id|           words|         tokens|    ngrams|
+---+----------------+---------------+----------+
|  0|Hi I heard about|[h, e, l, l, o]|[hel, ell]|
+---+----------------+---------------+----------+
murtihash
  • 8,030
  • 1
  • 14
  • 26
  • Thank you. It works. On my full dataset however, expr() takes 24.27 sec against 20.89 sec for explode/regexp_replace/groupby(collect_list). Do you know if in general, higher order functions are less efficient? – Béatrice Moissinac Mar 02 '20 at 21:22
  • @BéatriceMoissinac thats the first time i have heard explode, then groupby collect(big shuffle operation) list outperforming a higher order function, because that almost never happens. I would still recommend you to use higher order function transform. – murtihash Mar 02 '20 at 21:25
  • It is highly possible that it is due to our cluster's setup. (many people and one hamster's wheel :) – Béatrice Moissinac Mar 02 '20 at 21:27
  • 1
    @BéatriceMoissinac that very well could be true. these higher order functions were introduced for the sole purpose of removing the need to explode, transform, groupby(collect_list) on big data. glad i could help – murtihash Mar 02 '20 at 21:29