PySpark - top-n words from multiple files files

Question

I have a python dictionary:

diction = {'1.csv': 'this is is a test test test ', '2.txt': 'that that was a test test test'}

I have created an RDD like this:

docNameToText = sc.parallelize(diction)

I need to calculate find the top-2 strings appearing in each document. So, the result should look something like this:

1.txt, test, is
2.txt, test, that

I am new to pyspark, I know the algorithm, but not sure how to do it is pyspark. I need to:

- convert the file-to-string => file-to-wordFreq
- arrange wordFreq in non-increasing order - if two words have the same freq, arrange them in alphabetical order
- display the top 2

How can I implement this?

score 0 · Answer 1 · answered Apr 17 '17 at 12:06

0

Just use Counter:

from collections import Counter 

(sc
    .parallelize(diction.items())
    # Split by whitepace
    .mapValues(lambda s: s.split())
    # Count
    .mapValues(Counter)
    # Take most commont
    .mapValues(lambda c: [x for (x, _) in c.most_common(2)]))

answered Apr 17 '17 at 12:06

user7878518

1

thanks! Minor clarification: what if I also wanted to put results in alphabetical order if two words have the same count? For example, for '1.csv', result should be ['test', 'that'] and not ['that', 'test']. Thanks. – stfd1123581321 Apr 17 '17 at 12:58

PySpark - top-n words from multiple files files

1 Answers1