I'm a newbie in PySpark, and I want to translate the NLP-based feature code which is pythonic, into PySpark.
#Python
N = 2
n_grams = lambda input_text: 0 if pd.isna(input_text) else len(set([input_text[character_index:character_index+N] for character_index in range(len(input_text)-N+1)]))
#quick test
n_grams_example = 'zhang1997' #output = [‘zh’, ‘ha’, ‘an’, ‘ng’, ‘g1’, ‘19’, ‘99’ , ‘97’]
n_grams(n_grams_example) # 8
I checked the NGram Python docs and I tried the following unseccessfully:
#PySpark
from pyspark.ml.feature import NGram
ndf = spark.createDataFrame([
(0, ["zhang1997"])], ["id", "words"])
ndf.show()
+---+-----------+
| id| words|
+---+-----------+
| 0|[zhang1997]|
+---+-----------+
ngram = NGram(n=2, inputCol="words", outputCol="ngrams")
ngramDataFrame = ngram.transform(ndf)
ngramDataFrame.select("ngrams").show(truncate=False)
+------+
|ngrams|
+------+
|[] |
+------+
Do I miss something here I get empty []
as a result instead of [‘zh’, ‘ha’, ‘an’, ‘ng’, ‘g1’, ‘19’, ‘99’ , ‘97’]
? I'm interested to get its length of n-gram sets which is 8
in this case.
update: I found a way to do this without using NGram
but I'm not happy with its performance.
def n_grams(input_text):
if input_text is None:
return 0
N = 2
return len(set([input_text[character_index:character_index+N] for character_index in range(len(input_text)-N+1)]))