Map key, value pair based on similarity of their value in Spark

Question

I have been learning Spark for several weeks, currently I am trying to group several items or people based on their connection using Spark and Hadoop in Scala. For example, I want to see how football players are connected based on their club history. My "players" rdd would be:

(John, FC Sion)
(Mike, FC Sion)
(Bobby, PSV Eindhoven)
(Hans, FC Sion)

I want to have rdd like this:

(John, <Mike, Hans>)
(Mike, <John, Hans>)
(Bobby, <>)
(Hans, <Mike, John>)

I plan to use map to accomplish this.

val splitClubs = players.map(player=> (player._1, parseTeammates(player._2, players)))

Where parseTeammates is a function that will find players that are also playing for same club (player._2)

// RDD is not a type, how can I insert rdd into a function?
def parseTeammates(club: String, rdd: RDD) : List[String] = {
    // will generate a list of players that contains same "club" value
    val playerList = rdd.filter(_._1 == club)
    return playerList.values;
}

I get compilation error, type mismatch since the function is expected to return List[String] but instead playerList.values returns org.apache.spark.rdd.RDD[List[String]]. Can anybody help me to get the value of an RDD in its simple form (in my case, List[String]) ?

Also, I think there is a more elegant way to solve this problem, rather than creating a separate RDD and then find a certain key in the new RDD and then returning the value as a list

Community detection algorithms are several. And discussing them here is too long and quite out of the scope of SO. What are you trying to focus on? What is the algorithm you are trying to implement? — eliasah, Oct 18 '15 at 16:25
I am trying to implement it using map and reduce with Spark and Hadoop in Scala language — bonchenko, Oct 18 '15 at 16:44
What is the algorithm? Map reduce isn't an algorithm, it's a programming paradigm derived from functional programming. — eliasah, Oct 18 '15 at 16:45
Can you add the pseudo-code of what you are trying to implement? Your description is low quality and it sounds as if you are asking us to do your homework for you. — eliasah, Oct 18 '15 at 16:47
I changed my code, now it includes the code I am trying to implement — bonchenko, Oct 18 '15 at 17:36
yes. see last sentence in paragraph one. please read the question carefully before asking — bonchenko, Oct 18 '15 at 17:50
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/92654/discussion-between-eliasah-and-bonchenko). — eliasah, Oct 18 '15 at 17:54

score 2 · Accepted Answer · answered Oct 19 '15 at 12:44

I think your parseTeammates approach is a little off in the world of RDDs. When it comes to dealing with RDDs and potentially really, REALLY large amount of data, you don't want to do this kind of nested looping. Try instead to re-organize your data.

The code below will get you what you want

players.map{case(player, club) => (club, List(player))}
   .reduceByKey(_++_)
   .flatMap{case(_, list) =>list.zipWithIndex.map{case(player, index) => (player, list.take(index) ++ list.drop(index+1))}}

Note that I first organize the data according to the club they played for and then afterwards combine the players to yield the result in the format you are looking for.

I hope this helps.

score 0 · Answer 2 · answered Oct 21 '15 at 16:23

A different take on @Glennie's solution (who IMO is right about your initial approach being unfit).

TL;DR;

players.map { case (player, team) => (team, mutable.HashSet[String](player)) }
  .reduceByKey(_++=_)
  .flatMap {
      case (team, players) => {
        for (player <- players)
          yield (player, players - player)
      }
  }

The basic idea is the same (build a list of teammattes keyed by teams, and flatMap this result). But I suggest using other building blocks for the same result. Whether or not this is a win depends on taste, and the performance caracteristics of your dataset.

Different take on the `reduceByKey`

Reducing by key, here, involves concatenating a collection (of players) with one or many more players. If we take the original code of :

players.map{case(player, club) => (club, List(player))}
   .reduceByKey(_++_)

Internally, we will end up calling something like (as of scala 1.4) :

def add: (List[String], List[String]) => List[String] = _++_;

players.map { case (player, team) => (team, List(player)) }
       .combineByKey(
           // The first time we see a new team on each partition
           (list: List[String]) => list, 
           // invoked each time we fusion a player in its team's list
           // (e.g. map side combine) 
           add, 
           // invoked each time we fusion each team's partial lists
           // (e.g. reduce side combine)
           add)

The takeaway here is that the add operation (_++_) is called a lot of times. So it better be optimized.
In this respect, we know that List performs poorly, because each mutation entails a whole copying of the existing list into another. Please note : "poorly" may actually be irrelevant. If you have millions of teams, and only 20 players per team, then the ++ performance might be dwarfed by other spark computations involved in the reducing.

(Out of the top of my head, there's worse : if the List becomes really big, seeing some of the operations involved in its serialization are implemented recursively, we might hit a stackoverflow. I'd have to check on that).

So we might benefit from switching to a mutable collection, like so :

players.map { case (player, team) => (team, mutable.ArrayBuffer[String](player)) }
  .reduceByKey(_++=_)

We now have a mutable collection, for which concatenation is optimized
We use ++= instead of ++, so that each time, we do not even have to allocate a brand new collection when fusionning two existing ones
If we know or dataset well, we may be able to pre-size the buffer as to have predictable memory allocation, and avoid as much as possible buffer resizing. Or switch implementation, accordingly.

Different take on the `flatMap`

Gains of using a mutable collection

The original implementation uses, once more, extensive list operations like take and drop, combined with a zip with index.

The use of a mutable collection serves us better in terms of readability here, as we can replace 3 immutable list copies (take, drop, ++) :

list.take(index) ++ list.drop(index+1)

With only one (- performs a clone)

list - list(index)

Alternative syntax

We can also provide a totally different implementation, avoiding the zipping with index to leverage for comprehensions :

  .flatMap {
      case (team, players) => {
        for (player <- players)
          yield (player, players - player)
      }
    }

Note that the players - player step involves a lookup of the player in the list. Using an ArrayBuffer, this is a O(n) operation. So we may, once again, depending of the dataset, prefer the use of a mutable.HashSet as a mutable collection instead of an array buffer if we go down this path.

(I was going to add provided we have no duplicate in player names, but this does not matter, because if you have two "John"s in a team, then it's of no use to have two lines in your RDD for the two Johns, it does not hold any more meaning than a single one.)

Map key, value pair based on similarity of their value in Spark

2 Answers2

TL;DR;

Different take on the `reduceByKey`

Different take on the `flatMap`

Gains of using a mutable collection

Alternative syntax

Linked

Map key, value pair based on similarity of their value in Spark

2 Answers2

TL;DR;

Different take on the reduceByKey

Different take on the flatMap

Gains of using a mutable collection

Alternative syntax

Linked

Different take on the `reduceByKey`

Different take on the `flatMap`