1

I'm a newbie in PySpark, and I want to translate the NLP-based feature code which is pythonic, into PySpark.

#Python
N = 2

n_grams = lambda input_text: 0 if pd.isna(input_text) else len(set([input_text[character_index:character_index+N] for character_index in range(len(input_text)-N+1)]))


#quick test 
n_grams_example = 'zhang1997'  #output = [‘zh’, ‘ha’, ‘an’, ‘ng’, ‘g1’, ‘19’, ‘99’ , ‘97’]
n_grams(n_grams_example)       # 8

I checked the NGram Python docs and I tried the following unseccessfully:

#PySpark
from pyspark.ml.feature import NGram

ndf = spark.createDataFrame([
    (0, ["zhang1997"])], ["id", "words"])

ndf.show()

+---+-----------+
| id|      words|
+---+-----------+
|  0|[zhang1997]|
+---+-----------+

ngram = NGram(n=2, inputCol="words", outputCol="ngrams")

ngramDataFrame = ngram.transform(ndf)
ngramDataFrame.select("ngrams").show(truncate=False)
+------+
|ngrams|
+------+
|[]    |
+------+

Do I miss something here I get empty [] as a result instead of [‘zh’, ‘ha’, ‘an’, ‘ng’, ‘g1’, ‘19’, ‘99’ , ‘97’]? I'm interested to get its length of n-gram sets which is 8 in this case.

update: I found a way to do this without using NGram but I'm not happy with its performance.

def n_grams(input_text):
    if input_text is None:
        return 0
    N = 2
    return len(set([input_text[character_index:character_index+N] for character_index in range(len(input_text)-N+1)]))
Mario
  • 1,631
  • 2
  • 21
  • 51
  • You're creating bigrams over a list of length 1 (which should indeed be the empty list). You probably want that to be a list of characters – erip Sep 15 '21 at 15:33
  • I guess specifically what I'm saying is: you want character bigrams, but you're computing is word bigrams. – erip Sep 15 '21 at 15:44
  • Thanks for your reply, So how I can reach the list of characters using the bigram/n-gram and count their sets? – Mario Sep 15 '21 at 15:57
  • @erip I also tried to adapt this [answer](https://stackoverflow.com/a/39801829/10452700) unsuccessfully. – Mario Sep 15 '21 at 16:14

0 Answers0