N-gram counts and unique value using spark.ml library

Question

What I am trying to do here is calculate N-gram using the code provided here Stack Overflow Answer for N-gram

The data below is test data actual computation will be on large distributed data

+--------------+
|        author|
+--------------+
|Test Data Five|
|Test Data Five|
|Data Test Five|
|Test data Five|
|Test Data Five|
|          Jack|
+--------------+

from pyspark.ml.feature import NGram
from pyspark.ml import Pipeline
from pyspark.sql import functions as F

def build_ngrams(name,n=3):
    ngrams = [
        NGram(n=i, inputCol=name, outputCol="{0}_grams".format(i))
        for i in range(1, n + 1)
    ]

    return Pipeline(stages=ngrams)

temp_kdf = author_df.withColumn("author", F.split("author", "\s+"))
temp_kdf = temp_kdf.groupby().agg(F.collect_list('author').alias('author'))
data = temp_kdf.select(F.flatten(temp_kdf.author).alias('author'))
temp_kdf = build_ngrams('author).fit(data).transform(data)

The results I get are as below

+--------------------+
|             2_grams|
+--------------------+
|[Test Data, Data Five, Five Test, Test Data, Data Five, Five Data, Data Test, Test Five, Five Test, Test data, data Five, Five Test, Test Data, Data Five, Five Jack]|
+--------------------+

The result I want is top 'N' rows in the "n_gram" with their freq count like this

+---------+--------+
|  1_grams|1_counts|
+---------+--------+
|Test Data|       3|
|Data Five|       3|
|Five Test|       3|
|Five Data|       1|
+---------+--------+

JACK · Answer 1 · 2019-12-12T11:28:37.590

0

temp_data
    .select(col)
    .rdd
    .flatMap(lambda doc: [(x, 1) for x in doc[0]])
    .reduceByKey(Lambda x, y: x + y)

Can someone provide a better-optimized solution than this

edited Dec 12 '19 at 11:28

answered Dec 12 '19 at 08:55

JACK

414
1
5
15

N-gram counts and unique value using spark.ml library

1 Answers1