9

Suppose you have

val docs = List(List("one", "two"), List("two", "three"))

where e.g. List("one", "two") represents a document containing terms "one" and "two", and you want to build a map with the document frequency for every term, i.e. in this case

Map("one" -> 1, "two" -> 2, "three" -> 1)

How would you do that in Scala? (And in an efficient way, assuming a much larger dataset.)

My first Java-like thought is to use a mutable map:

val freqs = mutable.Map.empty[String,Int]
for (doc <- docs)
  for (term <- doc)
    freqs(term) = freqs.getOrElse(term, 0) + 1

which works well enough but I'm wondering how you could do that in a more "functional" way, without resorting to a mutable map?

Mirko N.
  • 10,537
  • 6
  • 38
  • 37

3 Answers3

20

Try this:

scala> docs.flatten.groupBy(identity).mapValues(_.size)
res0: Map[String,Int] = Map(one -> 1, two -> 2, three -> 1)

If you are going to be accessing the counts many times, then you should avoid mapValues since it is "lazy" and, thus, would recompute the size on every access. This version gives you the same result but won't require the recomputations:

docs.flatten.groupBy(identity).map(x => (x._1, x._2.size))

The identity function just means x => x.

dhg
  • 52,383
  • 8
  • 123
  • 144
  • Nice. It does seem slower than with a mutable map though, with just ~10k terms. Cost of transforming collections 3 times? – Mirko N. Aug 28 '12 at 20:00
  • Yeah, it's nice and functional, but copying all that data around isn't helping the efficiency. The mutable Map version doesn't waste a lot of time. – dhg Aug 28 '12 at 20:06
  • +1 for teaching me that `mapValues` recomputes the map at each traversal. But in that case a expression with `foldLeft` should perform better than `groubBy`. – paradigmatic Aug 28 '12 at 21:36
  • Perfect answer. According to my benchmarking for my use case this vs a manual mutable map implementation is actually up to 4x faster. Speed is important for my use as this sits inside a mapreduce job processing TB of data. Furthermore as the Scala compiler and JVM get better over time this will auto optimize, whereas manual implementations will not. – samthebest Nov 11 '13 at 11:45
  • @samthebest I wonder why people concerned of performance systematically choose this solution instead of foldLeft of map with default value, which was intentionally designed for this problem. Performance figures undoubtfully demonstrate that maps outperform the groupBy http://stackoverflow.com/a/24723417/1083704. – Val Jul 13 '14 at 14:18
  • 4
    @Val "undoubtfully"? Well I just checked `l.groupBy(identity).mapValues(_.size)` against `l.foldLeft(Map.empty[Int, Int].withDefaultValue(0))((m, x) => m + (x -> (1 + m(x))))` where `l` is `(1 to 10000).map(_ => scala.util.Random.nextInt(100)).toList`. With 5000 trials the `groupBy` approach took 2510 ms, whereas the `foldLeft` approach took 8349 ms. I've repeated this experiment with many other distributions, and different machines. Anyway, if you actually look at the implementation of `groupBy` you'll see why :) – samthebest Jul 13 '14 at 18:37
13
docs.flatten.foldLeft(new Map.WithDefault(Map[String,Int](),Function.const(0))){
  (m,x) => m + (x -> (1 + m(x)))}

What a train wreck!

[Edit]

Ah, that's better!

docs.flatten.foldLeft(Map[String,Int]() withDefaultValue 0){
  (m,x) => m + (x -> (1 + m(x)))}
Landei
  • 54,104
  • 13
  • 100
  • 195
  • 2
    You can shorten the map initialization: `docs.flatten.foldLeft( Map[String,Int]() withDefaultValue 0 ){ (m,x) => ... }` – paradigmatic Aug 28 '12 at 21:33
  • 1
    It does seem faster than `groupBy`, so marking this as accepted. But both answers are interesting. – Mirko N. Aug 29 '12 at 15:41
0

Starting Scala 2.13, after flattening the list of lists, we can use groupMapReduce which is a one-pass alternative to groupBy/mapValues:

// val docs = List(List("one", "two"), List("two", "three"))
docs.flatten.groupMapReduce(identity)(_ => 1)(_ + _)
// Map[String,Int] = Map("one" -> 1, "three" -> 1, "two" -> 2)

This:

  • flattens the List of Lists as a List

  • groups list elements (identity) (group part of groupMapReduce)

  • maps each grouped value occurrence to 1 (_ => 1) (map part of groupMapReduce)

  • reduces values within a group of values (_ + _) by summing them (reduce part of groupMapReduce).

Xavier Guihot
  • 54,987
  • 21
  • 291
  • 190