What is the alternative for combineByKey while using Tuple3 in Apache Spark in Java?

Question

I am just starting out with Apache Spark in Java. I am currently doing a mini project with some books data. I have to find the most popular author in each country.

I have a pairRDD where the Key is the country and Value is the Author, like this

[(usa,C. S Lewis), (australia,Jason Shinder), (usa,Bernie S.), (usa,Bernie S.)]

Do I have to use Tuple3 to add one more field and count the number of times each value is present? If so, how do I use combineByKey for Tuple3?

I had another idea where I could take all keys from the pairRDD and based on that, I could filter to use another pairRDD with author_names and number of times each of them is mentioned with which I could find the most popular author. But this doesn't feel like an elegant solution as I have to loop through the array of keys. Help.

score 1 · Answer 1 · answered Oct 31 '17 at 11:48

1

This is literally YAW (Yet Another Wordcount):

rdd.mapToPair(s -> new Tuple2<>(s, 1)).reduceByKey((c1, c2) -> c1 + c2);

answered Oct 31 '17 at 11:48

user8862144

11
1

Word count problem has RDD which you then convert to PairRDD with 1 as value. Here, I have two fields already. Key is country and Value is Author. The solution required depends on both these fields as I have to find popular author for each country and hence I cannot afford to replace either value with 1 to use reduceByKey. – kaushik3993 Oct 31 '17 at 12:24

What is the alternative for combineByKey while using Tuple3 in Apache Spark in Java?

1 Answers1