A different take on @Glennie's solution (who IMO is right about your initial approach being unfit).
TL;DR;
players.map { case (player, team) => (team, mutable.HashSet[String](player)) }
.reduceByKey(_++=_)
.flatMap {
case (team, players) => {
for (player <- players)
yield (player, players - player)
}
}
The basic idea is the same (build a list of teammattes keyed by teams, and flatMap
this result). But I suggest using other building blocks for the same result. Whether or not this is a win depends on taste, and the performance caracteristics of your dataset.
Different take on the reduceByKey
Reducing by key, here, involves concatenating a collection (of players) with one or many more players.
If we take the original code of :
players.map{case(player, club) => (club, List(player))}
.reduceByKey(_++_)
Internally, we will end up calling something like (as of scala 1.4) :
def add: (List[String], List[String]) => List[String] = _++_;
players.map { case (player, team) => (team, List(player)) }
.combineByKey(
// The first time we see a new team on each partition
(list: List[String]) => list,
// invoked each time we fusion a player in its team's list
// (e.g. map side combine)
add,
// invoked each time we fusion each team's partial lists
// (e.g. reduce side combine)
add)
The takeaway here is that the add
operation (_++_
) is called a lot of times. So it better be optimized.
In this respect, we know that List
performs poorly, because each mutation entails a whole copying of the existing list into another. Please note : "poorly" may actually be irrelevant. If you have millions of teams, and only 20 players per team, then the ++
performance might be dwarfed by other spark computations involved in the reducing.
(Out of the top of my head, there's worse : if the List
becomes really big, seeing some of the operations involved in its serialization are implemented recursively, we might hit a stackoverflow. I'd have to check on that).
So we might benefit from switching to a mutable collection, like so :
players.map { case (player, team) => (team, mutable.ArrayBuffer[String](player)) }
.reduceByKey(_++=_)
- We now have a mutable collection, for which concatenation is optimized
- We use
++=
instead of ++
, so that each time, we do not even have to allocate a brand new collection when fusionning two existing ones
- If we know or dataset well, we may be able to pre-size the buffer as to have predictable memory allocation, and avoid as much as possible buffer resizing. Or switch implementation, accordingly.
Different take on the flatMap
Gains of using a mutable collection
The original implementation uses, once more, extensive list operations like take
and drop
, combined with a zip with index.
The use of a mutable collection serves us better in terms of readability here, as we can replace 3 immutable list copies (take
, drop
, ++
) :
list.take(index) ++ list.drop(index+1)
With only one (-
performs a clone)
list - list(index)
Alternative syntax
We can also provide a totally different implementation, avoiding the zipping with index to leverage for comprehensions :
.flatMap {
case (team, players) => {
for (player <- players)
yield (player, players - player)
}
}
Note that the players - player
step involves a lookup of the player in the list. Using an ArrayBuffer
, this is a O(n) operation. So we may, once again, depending of the dataset, prefer the use of a mutable.HashSet
as a mutable collection instead of an array buffer if we go down this path.
(I was going to add provided we have no duplicate in player names, but this does not matter, because if you have two "John"s in a team, then it's of no use to have two lines in your RDD for the two Johns, it does not hold any more meaning than a single one.)