3

I am trying to implement a simple wordcount in scala using an immutable map(this is intentional) and the way I am trying to accomplish it is as follows:

  1. Create an empty immutable map
  2. Create a scanner that reads through the file.
  3. While the scanner.hasNext() is true:

    • Check if the Map contains the word, if it doesn't contain the word, initialize the count to zero
    • Create a new entry with the key=word and the value=count+1
    • Update the map
  4. At the end of the iteration, the map is populated with all the values.

My code is as follows:

val wordMap = Map.empty[String,Int]
val input = new java.util.scanner(new java.io.File("textfile.txt"))
while(input.hasNext()){
  val token = input.next()
  val currentCount = wordMap.getOrElse(token,0) + 1
  val wordMap = wordMap + (token,currentCount)
}

The ides is that wordMap will have all the wordCounts at the end of the iteration... Whenever I try to run this snippet, I get the following exception

recursive value wordMap needs type.

Can somebody point out why I am getting this exception and what can I do to remedy it?

Thanks

om-nom-nom
  • 62,329
  • 13
  • 183
  • 228
sc_ray
  • 7,803
  • 11
  • 63
  • 100

2 Answers2

7
val wordMap = wordMap + (token,currentCount)

This line is redefining an already-defined variable. If you want to do this, you need to define wordMap with var and then just use

wordMap = wordMap + (token,currentCount)

Though how about this instead?:

io.Source.fromFile("textfile.txt")            // read from the file
  .getLines.flatMap{ line =>                  // for each line
     line.split("\\s+")                       // split the line into tokens
       .groupBy(identity).mapValues(_.size)   // count each token in the line
  }                                           // this produces an iterator of token counts
  .toStream                                   // make a Stream so we can groupBy
  .groupBy(_._1).mapValues(_.map(_._2).sum)   // combine all the per-line counts
  .toList

Note that the per-line pre-aggregation is used to try and reduce the memory required. Counting across the entire file at once might be too big.

If your file is really massive, I would suggest using doing this in parallel (since word counting is trivial to parallelize) using either Scala's parallel collections or Hadoop (using one of the cool Scala Hadoop wrappers like Scrunch or Scoobi).

EDIT: Detailed explanation:

Ok, first look at the inner part of the flatMap. We take a string, and split it apart on whitespace:

val line = "a b c b"
val tokens = line.split("\\s+") // Array(a, b, c, a, b)

Now identity is a function that just returns its argument, so if wegroupBy(identity)`, we map each distinct word type, to each word token:

val grouped = tokens.groupBy(identity) // Map(c -> Array(c), a -> Array(a), b -> Array(b, b))

And finally, we want to count up the number of tokens for each type:

val counts = grouped.mapValues(_.size) // Map(c -> 1, a -> 1, b -> 2)

Since we map this over all the lines in the file, we end up with token counts for each line.

So what does flatMap do? Well, it runs the token-counting function over each line, and then combines all the results into one big collection.

Assume the file is:

a b c b
b c d d d
e f c

Then we get:

val countsByLine = 
  io.Source.fromFile("textfile.txt")            // read from the file
    .getLines.flatMap{ line =>                  // for each line
       line.split("\\s+")                       // split the line into tokens
         .groupBy(identity).mapValues(_.size)   // count each token in the line
    }                                           // this produces an iterator of token counts
println(countsByLine.toList) // List((c,1), (a,1), (b,2), (c,1), (d,3), (b,1), (c,1), (e,1), (f,1))

So now we need to combine the counts of each line into one big set of counts. The countsByLine variable is an Iterator, so it doesn't have a groupBy method. Instead we can convert it to a Stream, which is basically a lazy list. We want laziness because we don't want to have to read the entire file into memory before we start. Then the groupBy groups all counts of the same word type together.

val groupedCounts = countsByLine.toStream.groupBy(_._1)
println(groupedCounts.mapValues(_.toList)) // Map(e -> List((e,1)), f -> List((f,1)), a -> List((a,1)), b -> List((b,2), (b,1)), c -> List((c,1), (c,1), (c,1)), d -> List((d,3)))

And finally we can sum up the counts from each line for each word type by grabbing the second item from each tuple (the count), and summing:

val totalCounts = groupedCounts.mapValues(_.map(_._2).sum)
println(totalCounts.toList)
List((e,1), (f,1), (a,1), (b,3), (c,3), (d,3))

And there you have it.

dhg
  • 52,383
  • 8
  • 123
  • 144
  • Another pretty mature snippet from you. I am still trying to work through the basics of scala but it is good to know some of the capabilities of scala. Thanks for the tips. – sc_ray Apr 16 '12 at 00:20
  • I actually was going through your snippet, it does the word counting like a charm but I am having trouble understanding some of the subtleties of what you have up there. Like why do we need flatMap? What does groupBy(identity).mapValues(_.size) do? How does the toStream work? Is the io.source.fromFile("textFile.txt").getLines returns a collection of all the lines at the same time? How does the .groupBy(_.1).mapValues(_.map(_._2).sum) work? In short some of the syntactical pyrotechnics has left me a little befuddled. And it would be great to understand the nitty-gritties. – sc_ray Apr 16 '12 at 01:25
  • @sc_ray, I'll break it down a bit for you. Give me a few minutes. – dhg Apr 16 '12 at 01:27
  • 1
    @sc_ray, Detailed explanation added. – dhg Apr 16 '12 at 01:56
  • Excellent! Plugging the statements in the REPL and reading your explanations really made a humongous difference. Thanks a lot! – sc_ray Apr 16 '12 at 02:10
3

You have a few mistakes: you've defined wordMap twice (val is to declare a value). Also, Map is immutable, so you either have to declare it as a var or use a mutable map (I suggest the former).

Try this:

var wordMap = Map.empty[String,Int] withDefaultValue 0
val input = new java.util.Scanner(new java.io.File("textfile.txt"))
while(input.hasNext()){
  val token = input.next()
  wordMap += token -> (wordMap(token) + 1)
}
Luigi Plinge
  • 50,650
  • 20
  • 113
  • 180