What I am trying to do here is calculate N-gram using the code provided here Stack Overflow Answer for N-gram
The data below is test data actual computation will be on large distributed data
+--------------+
| author|
+--------------+
|Test Data Five|
|Test Data Five|
|Data Test Five|
|Test data Five|
|Test Data Five|
| Jack|
+--------------+
from pyspark.ml.feature import NGram
from pyspark.ml import Pipeline
from pyspark.sql import functions as F
def build_ngrams(name,n=3):
ngrams = [
NGram(n=i, inputCol=name, outputCol="{0}_grams".format(i))
for i in range(1, n + 1)
]
return Pipeline(stages=ngrams)
temp_kdf = author_df.withColumn("author", F.split("author", "\s+"))
temp_kdf = temp_kdf.groupby().agg(F.collect_list('author').alias('author'))
data = temp_kdf.select(F.flatten(temp_kdf.author).alias('author'))
temp_kdf = build_ngrams('author).fit(data).transform(data)
The results I get are as below
+--------------------+
| 2_grams|
+--------------------+
|[Test Data, Data Five, Five Test, Test Data, Data Five, Five Data, Data Test, Test Five, Five Test, Test data, data Five, Five Test, Test Data, Data Five, Five Jack]|
+--------------------+
The result I want is top 'N' rows in the "n_gram" with their freq count like this
+---------+--------+
| 1_grams|1_counts|
+---------+--------+
|Test Data| 3|
|Data Five| 3|
|Five Test| 3|
|Five Data| 1|
+---------+--------+