1

Wanted some insights on spark execution on standalone and yarn. We have a 4 node cloudera cluster and currently the performance of our application while running in YARN mode is less than half than what we are getting while executing in standalone mode. Is anyone having some idea on the factors which might be contributing for this.

Sumit Khurana
  • 159
  • 1
  • 10
  • Is that on the same set of machines, and your spark submit is run with exactly the same parameters for the number executors, the max memory, the number of cores, etc.? – ernest_k Apr 12 '18 at 10:09
  • @ErnestKiwele yes the only difference is the --master parameter. – Sumit Khurana Apr 12 '18 at 10:11
  • Interesting. Can you post the detailed job/stage/task execution duration as per the respective UIs (screenshots)? Also, are you sure the difference is not overhead-related (if applicable)... I wouldn't be worried about a 4sec/10sec execution time difference, but would be worried about a 4min/10min difference (deployment overhead?) – ernest_k Apr 12 '18 at 10:18
  • @ErnestKiwele nothing of that sort would be able to share the screenshot tomorrow – Sumit Khurana Apr 12 '18 at 11:34
  • How *BIG* is your data set and how long does it take? – tk421 Apr 12 '18 at 20:54
  • @tk421 we are using spark streaming to process JSON messages (approx 8K each),performing some validation and writing it to MQ. – Sumit Khurana Apr 12 '18 at 23:49

1 Answers1

3

Basically, your data and cluster are too small.

Big Data technologies are really meant to handle data that cannot fit on a single system. Given your cluster has 4 nodes, it might be fine for POC work but you should not consider this acceptable for benchmarking your application.

To give you a frame of reference refer to Hortonworks's article BENCHMARK: SUB-SECOND ANALYTICS WITH APACHE HIVE AND DRUID uses a cluster of:

  • 10 nodes
  • 2x Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz with 16 CPU threads each
  • 256 GB RAM per node
  • 6x WDC WD4000FYYZ-0 1K02 4TB SCSI disks per node

This works out to 320 CPU cores, 2560GB RAM, 240TB of disk.

Another benchmark from Cloudera's article New SQL Benchmarks: Apache Impala (incubating) Uniquely Delivers Analytic Database Performance uses a 21 node cluster with each node at:

  • CPU: 2 sockets, 12 total cores, Intel Xeon CPU E5-2630L 0 at 2.00GHz
  • 12 disk drives at 932GB each (one for the OS, the rest for HDFS)
  • 384GB memory

This works out to 504 CPU cores, 8064GB RAM and 231TB of disk.

This should give an idea of the scale that would qualify your system as reliable for benchmarking purposes.

tk421
  • 5,775
  • 6
  • 23
  • 34