2

I have a python dictionary:

diction = {'1.csv': 'this is is a test test test ', '2.txt': 'that that was a test test test'}

I have created an RDD like this:

docNameToText = sc.parallelize(diction)

I need to calculate find the top-2 strings appearing in each document. So, the result should look something like this:

1.txt, test, is
2.txt, test, that

I am new to pyspark, I know the algorithm, but not sure how to do it is pyspark. I need to:

- convert the file-to-string => file-to-wordFreq
- arrange wordFreq in non-increasing order - if two words have the same freq, arrange them in alphabetical order
- display the top 2

How can I implement this?

zero323
  • 322,348
  • 103
  • 959
  • 935
stfd1123581321
  • 163
  • 1
  • 2
  • 6

1 Answers1

0

Just use Counter:

from collections import Counter 

(sc
    .parallelize(diction.items())
    # Split by whitepace
    .mapValues(lambda s: s.split())
    # Count
    .mapValues(Counter)
    # Take most commont
    .mapValues(lambda c: [x for (x, _) in c.most_common(2)]))
  • thanks! Minor clarification: what if I also wanted to put results in alphabetical order if two words have the same count? For example, for '1.csv', result should be ['test', 'that'] and not ['that', 'test']. Thanks. – stfd1123581321 Apr 17 '17 at 12:58