I'm a beginner with Spark and I am trying to create an RDD that contains the top 3 values for every key, (Not just the top 3 values). My current RDD contains thousands of entries in the following format:
(key, String, value)
So imagine I had an RDD with content like this:
[("K1", "aaa", 6), ("K1", "bbb", 3), ("K1", "ccc", 2), ("K1", "ddd", 9),
("B1", "qwe", 4), ("B1", "rty", 7), ("B1", "iop", 8), ("B1", "zxc", 1)]
I can currently display the top 3 values in the RDD like so:
("K1", "ddd", 9)
("B1", "iop", 8)
("B1", "rty", 7)
Using:
top3RDD = rdd.takeOrdered(3, key = lambda x: x[2])
Instead what I want is to gather the top 3 values for every key in the RDD so I would like to return this instead:
("K1", "ddd", 9)
("K1", "aaa", 6)
("K1", "bbb", 3)
("B1", "iop", 8)
("B1", "rty", 7)
("B1", "qwe", 4)