0

I am just starting out with Apache Spark in Java. I am currently doing a mini project with some books data. I have to find the most popular author in each country.

I have a pairRDD where the Key is the country and Value is the Author, like this

[(usa,C. S Lewis), (australia,Jason Shinder), (usa,Bernie S.), (usa,Bernie S.)]

Do I have to use Tuple3 to add one more field and count the number of times each value is present? If so, how do I use combineByKey for Tuple3?

I had another idea where I could take all keys from the pairRDD and based on that, I could filter to use another pairRDD with author_names and number of times each of them is mentioned with which I could find the most popular author. But this doesn't feel like an elegant solution as I have to loop through the array of keys. Help.

kaushik3993
  • 105
  • 1
  • 3
  • 10

1 Answers1

1

This is literally YAW (Yet Another Wordcount):

rdd.mapToPair(s -> new Tuple2<>(s, 1)).reduceByKey((c1, c2) -> c1 + c2);
  • Word count problem has RDD which you then convert to PairRDD with 1 as value. Here, I have two fields already. Key is country and Value is Author. The solution required depends on both these fields as I have to find popular author for each country and hence I cannot afford to replace either value with 1 to use reduceByKey. – kaushik3993 Oct 31 '17 at 12:24