I'd like to find out good and robust MapReduce framework, to be utilized from Scala.
8 Answers
To add to the answer on Hadoop: there are at least two Scala wrappers that make working with Hadoop more palatable.
Scala Map Reduce (SMR): http://scala-blogs.org/2008/09/scalable-language-and-scalable.html
SHadoop: http://jonhnny-weslley.blogspot.com/2008/05/shadoop.html
UPD 5 oct. 11
There is also Scoobi framework, that has awesome expressiveness.

- 62,329
- 13
- 183
- 228

- 4,722
- 1
- 21
- 22
-
SHadoop is quite old--it uses the old MR framework. I updated the implicits at some point: https://github.com/schmmd/Hadoop-Scala-Commons – schmmd Dec 08 '11 at 22:57
Personally, I've become a big fan of Spark
You have the ability to do in-memory cluster computing, significantly reducing the overhead you would experience from disk-intensive mapreduce operations.

- 807
- 3
- 11
- 20
http://hadoop.apache.org/ is language agnostic.

- 6,854
- 24
- 35
-
I'm sorry but I didn't ask for Java implementation. Indeed, Hadoop can be plugged into Scala but the boilerplate code have to be written in Java. – Roman Kagan Jun 08 '09 at 03:26
-
1Write a ScalaHadoopAdapter which takes care of all the boilerplate and publish it as free/open-source? – yfeldblum Jun 12 '09 at 04:39
-
7
A while back, I ran into exactly this problem and ended up writing a little infrastructure to make it easy to use Hadoop from Scala. I used it on my own for a while, but I finally got around to putting it on the web. It's named (very originally) ScalaHadoop.

- 2,434
- 1
- 21
- 20
For a scala API on top of hadoop check out Scoobi, it is still in heavy development but shows a lot of promise. There is also some effort to implement distributed collections on top of hadoop in the Scala incubator, but that effort is not usable yet.
There is also a new scala wrapper for cascading from Twitter, called Scalding. After looking very briefly over the documentation for Scalding it seems that while it makes the integration with cascading smoother it still does not solve what I see as the main problem with cascading: type safety. Every operation in cascading operates on cascading's tuples (basically a list of field values with or without a separate schema), which means that type errors, I.e. Joining a key as a String and key as a Long leads to run-time failures.

- 1,404
- 1
- 9
- 5
-
Scalding does have a type-safe API: https://github.com/twitter/scalding/wiki/Type-safe-api-reference and in the Fields API (which you are mentioning), joining a string to a long doesn't cause run-time exceptions (if they are both numbers). Of course, in the type-safe API such a join is prohibited by the compiler. – Oscar Boykin Feb 20 '13 at 05:56
to further jshen's point:
hadoop streaming simply uses sockets. using unix streams, your code (any language) simply has to be able to read from stdin and output tab delimited streams. implement a mapper and if needed, a reducer (and if relevant, configure that as the combiner).

- 19
- 1
I've added MapReduce implementation using Hadoop on Github with few test cases here: https://github.com/sauravsahu02/MapReduceUsingScala. Hope that helps. Note that the application is already tested.

- 13,038
- 6
- 64
- 79