1

A paper "Making Sense of Performance in Data Analytics Frameworks" published in NSDI 2015 gives the conclusion that CPU(not IO or network) is the performance bottleneck of Spark. Kay has done some experiments on Spark including BDbench ,TPC-DS and a procdution workload(only Spark SQL is used?) in this paper. I wonder whether this conclusion is right for some frameworks built on Spark(like Streaming,with a continuous data stream received through network,both network IO and disk will suffer high pressure ).

Xingjun Wang
  • 413
  • 2
  • 4
  • 17

2 Answers2

2

It really depends on the job that you execute. you will need to analyze the job you write and see where the pressure and bottlenecks are. For instance I recently had a job that didn't have enough memory on the workers so it also had to spill to disk which increased its overall IO by a lot. When I removed the memory problem CPU was the next prob. tighter code moved the problem to IO etc.

Arnon Rotem-Gal-Oz
  • 25,469
  • 3
  • 45
  • 68
  • Thanks for your answer. You're actually right that it depends on the real workload. But I'm just wondering if this conclusion is right, since the paper does give it in a common modal. – Xingjun Wang May 18 '15 at 02:11
2

Network and disk may suffer less pressure in Spark Streaming because the streams are usually checkpointed, meaning all data is not usually kept around forever.

But ultimately, this is a research question : the only way to settle this one is to benchmark. Kay's code is open-source.

Francois G
  • 11,957
  • 54
  • 59
  • Thanks for your attention. But Kay's experiments are most based on Spark SQL, which is different from other frameworks in some aspects(though they share the same Spark core). I just wonder how could the paper give the conclusion on Spark(not Spark SQL). I'll look into this problem, thanks again! – Xingjun Wang May 18 '15 at 01:40