How to group top item from a JavaRDD without using Spark SQL?

Question

I have JavaRDD which contains userId, movieId and their ratings like this.

Rating [userId=1, movieId=2858, rating=4.0], Rating [userId=3, movieId=2858, rating=5.0], Rating [userId=12, movieId=2658, rating=5.0].

I want to find the top 5 movies based on the number of views. I tried googling but could not get on how to approach grouping movieId and userId in JavaRDD. I want to count how many users watched a movie and store it into Map as Map(movieId, num_of_user). I am new to apache spark.

Desired Output:

2858 - 2

2658 - 1

I would appreciate any similar example/link/tutorial to perform the similar operation on JavaRDD.

Update: I found similar scala based question. Can somebody have a look and , convert scala code to java code.

Thanks in advance.

score 0 · Answer 1 · answered Oct 29 '17 at 05:02

Adding Example , you can build your logic using following template

case class Rating(userId: Long, movieId: Long,rating:Long)
val RatingDF = List(
Rating(1, 2858,4),
Rating(3, 2858,5),
Rating(12,2658,5)
 ).toDF()

 RatingDF.show()
//using non sql approch 
import org.apache.spark.sql.functions._

 val topMovieIDs = 
 RatingDF.groupBy("movieId").count().orderBy(desc("count")).cache()
 topMovieIDs.show()

Results :

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|   2858|     4|
|     3|   2858|     5|
|    12|   2658|     5|
+------+-------+------+

+-------+-----+
|movieId|count|
+-------+-----+
|   2858|    2|
|   2658|    1|
+-------+-----+

How to group top item from a JavaRDD without using Spark SQL?

1 Answers1