What is the better way of making full use of multiple cores for parallel processing in a Scala/Hadoop system?
Let's say I need to process 100 million documents. Documents are not very large, but processing them is computationally intensive. If I have a Hadoop cluster with 100 machines with 10 cores each, I could either:
A) send 1000 documents to each machine and let Hadoop start a map on each of the 10 cores (or as many as are available)
or
B) send 1000 documents to each machine (still using Hadoop) and use Scala's parallel collections to make full use of the multiple cores. (I would put all documents in a parallel collection, and then call map
on the collection). In other words, use Hadoop for distribution at cluster level, and use parallel collections to manage the distribution to cores within each machine.