0

I have JavaRDD which contains userId, movieId and their ratings like this.

Rating [userId=1, movieId=2858, rating=4.0], Rating [userId=3, movieId=2858, rating=5.0], Rating [userId=12, movieId=2658, rating=5.0].

I want to find the top 5 movies based on the number of views. I tried googling but could not get on how to approach grouping movieId and userId in JavaRDD. I want to count how many users watched a movie and store it into Map as Map(movieId, num_of_user). I am new to apache spark.

Desired Output:

2858 - 2

2658 - 1

I would appreciate any similar example/link/tutorial to perform the similar operation on JavaRDD.

Update: I found similar scala based question. Can somebody have a look and , convert scala code to java code.

Thanks in advance.

Om Prakash
  • 2,675
  • 4
  • 29
  • 50

1 Answers1

0

Adding Example , you can build your logic using following template

case class Rating(userId: Long, movieId: Long,rating:Long)
val RatingDF = List(
Rating(1, 2858,4),
Rating(3, 2858,5),
Rating(12,2658,5)
 ).toDF()

 RatingDF.show()
//using non sql approch 
import org.apache.spark.sql.functions._

 val topMovieIDs = 
 RatingDF.groupBy("movieId").count().orderBy(desc("count")).cache()
 topMovieIDs.show()

Results :

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|   2858|     4|
|     3|   2858|     5|
|    12|   2658|     5|
+------+-------+------+

+-------+-----+
|movieId|count|
+-------+-----+
|   2858|    2|
|   2658|    1|
+-------+-----+
vaquar khan
  • 10,864
  • 5
  • 72
  • 96