-1

i have a dataset of a social network contaning information about how follows how i need to find most active user(for example the user that dose most followings) my data set lines are like bellow

 1000066:262792,273106,590979,1152305,1691577,1888250

and some of them are like these

1000073:private
1000069:notfound

questions 1: how to make rdd of any line in the way the key of all rdd pairs would be the first number that separated with ':' and values one by one separated with ','? question 2: how could i solve this problem using graphx? All i need is to find most active user in this dataset thanks in advance, answering any off these too will help

AliSafari186
  • 113
  • 9

1 Answers1

1

Q1. You could create a RDD tuple of (user, followers)

In a map function pass each line of the RDD to:

def createTuple(s: String) = {
  val kv = s.split(":")
  val user = kv(0)
  val followers = kv(1).split(",")
  val count = followers.length

  (user, followers, count)
}
bp2010
  • 2,342
  • 17
  • 34
  • this followers is an array of string in this case we must have an Rdd like this [String, Array(String)] it could be useful but how to find the count of each followers whit this RDD? – AliSafari186 Jun 13 '18 at 14:45
  • this followers is an array of string in this case we must have an Rdd like this [String, Array(String)] it could be useful but how to find the count of each followers whit this RDD? – AliSafari186 Jun 13 '18 at 14:48
  • edited the question to add the count. could be done in many other ways.. – bp2010 Jun 14 '18 at 07:46