I have a python dictionary:
diction = {'1.csv': 'this is is a test test test ', '2.txt': 'that that was a test test test'}
I have created an RDD like this:
docNameToText = sc.parallelize(diction)
I need to calculate find the top-2 strings appearing in each document. So, the result should look something like this:
1.txt, test, is
2.txt, test, that
I am new to pyspark, I know the algorithm, but not sure how to do it is pyspark. I need to:
- convert the file-to-string => file-to-wordFreq
- arrange wordFreq in non-increasing order - if two words have the same freq, arrange them in alphabetical order
- display the top 2
How can I implement this?