Why is Spark fast when word count?

Question

Test case: word counting in 6G data in 20+ seconds by Spark.

I understand MapReduce, FP and stream programming models, but couldn’t figure out the word counting is so amazing fast.

I think it’s an I/O intensive computing in this case, and it’s impossible to scan 6G files in 20+ seconds. I guess there is index is performed before word counting, like Lucene does. The magic should be in RDD (Resilient Distributed Datasets) design which I don’t understand well enough.

I appreciate if anyone could explain RDD for the word counting case. Thanks!

How many computers you used? – member555 Aug 30 '15 at 16:03 — member555, Aug 30 '15 at 16:03

score 4 · Accepted Answer · answered Mar 04 '15 at 10:53

First is startup time. Hadoop MapReduce job startup requires starting a number of separate JVMs which is not fast. Spark job startup (on existing Spark cluster) causes existing JVM to fork new task threads, which is times faster than starting JVM

Next, no indexing and no magic. 6GB file is stored in 47 blocks of 128MB each. Imagine you have a big enough Hadoop cluster that all of these 47 HDFS blocks are residing on different JBOD HDDs. Each of them would deliver you 70 MB/sec scan rate, which means you can read this data in ~2 seconds. With 10GbE network in your cluster you can transfer all of this data from one machine to another in just 7 seconds.

Lastly, Hadoop puts intermediate data to disks a number of times. It puts map output to the disk at least once (and more if the map output is big and on-disk merges happen). It puts the data to disks next time on reduce side before the reduce itself is executed. Spark puts the data to HDDs only once during the shuffle phase, and the reference Spark implementation recommends to increase the filesystem write cache not to make this 'shuffle' data hit the disks

All of this gives Spark a big performance boost compared to Hadoop. There is no magic in Spark RDDs related to this question

Do you mean there are 47 threads to handle the files? If so, the computer has 8 core-CPU, it is impossible to run 47 threads in 8 core-CPU concurrently, right? — 卢声远 Shengyuan Lu, Mar 06 '15 at 03:00
You mean the computations are done on a single machine? 8-core Xeon processes 16 parallel threads, if you have enough RAM all the file would be cached and read from memory (and even with disks, good raid gives comparable performance for sequential IO). — 0x0FFF, Mar 06 '15 at 04:47

score 0 · Answer 2 · answered Mar 04 '15 at 20:02

0

Other than the factors mentioned by 0x0FFF, local combining of results also makes spark run word count more efficiently. Spark, by default, combines results on each node before sending the results to other nodes.

In case of word count job, Spark calculates the count for each word on a node and then sends the results to other nodes. This reduces the amount of data to be transferred over network. To achieve the same functionality in Hadoop Map-reduce, you need to specify combiner class job.setCombinerClass(CustomCombiner.class)

By using combineByKey() in Spark, you can specify a custom combiner.

answered Mar 04 '15 at 20:02

Manmohan

178
1
6

AFAIK reference Hadoop WordCount implementation comes with combiner by default: http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html – 0x0FFF Mar 05 '15 at 04:14
But he did not mention that he has used this Wordcount implementation for comparing with Spark performance. – Manmohan Mar 05 '15 at 16:45
It is typical for benchmarking: showing that toolA is better than toolB people usually show the best case for toolA versus the worst case for toolB. To be clear, noone will use wordcount in production without combiner – 0x0FFF Mar 05 '15 at 16:50
You have made multiple assumptions. He did not mention he used Wordcount for production program or like that. Combiner concept in spark is an important feature and somebody who is new to Spark may note be aware of it. – Manmohan Mar 05 '15 at 16:56

score 0 · Answer 3 · answered Dec 14 '15 at 11:12

Apache Spark processes data in-memory while Hadoop MapReduce persists back to the disk after a map or reduce action. But Spark needs a lot of memory
Spark loads a process into memory and keeps it there until further notice, for the sake of caching.
Resilient Distributed Dataset (RDD), which allows you to transparently store data on memory and persist it to disc if it's needed.
Since Spark uses in-memory, there's no synchronisation barrier that's slowing you down. This is a major reason for Spark's performance.
Rather than just processing a batch of stored data, as is the case with MapReduce, Spark can also manipulate data in real time using Spark Streaming.
The DataFrames API was inspired by data frames in R and Python (Pandas), but designed from the ground-up to as an extension to the existing RDD API.
A DataFrame is a distributed collection of data organized into named columns, but with richer optimizations under the hood that supports to the speed of spark.
Using RDDs Spark simplifies complex operations like join and groupBy and in the backend, you’re dealing with fragmented data. That fragmentation is what enables Spark to execute in parallel.
Spark allows to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern. It supports in-memory data sharing across DAGs, so that different jobs can work with the same data. DAGs are a major part of Sparks speed.

Hope this helps.

Why is Spark fast when word count?

3 Answers3

Linked

Related