5

I have a Spark cluster and a Hdfs on the same machines. I've copied a single text file, about 3Gbytes, on each machine's local filesystem and on hdfs distributed filesystem.

I have a simple word count pyspark program.

If i submit the program reading the file from local filesystem, it lasts about 33 sec. If i submit the program reading the file from hdfs, it lasts about 46 sec.

Why ? I expected exactly the opposite result.

Added after sgvd's request:

16 slaves 1 master

Spark Standalone with no particular settings (replication factor 3)

Version 1.5.2

import sys
sys.path.insert(0, '/usr/local/spark/python/')
sys.path.insert(0, '/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip')
import os
os.environ['SPARK_HOME']='/usr/local/spark'
os.environ['JAVA_HOME']='/usr/local/java'
from pyspark import SparkContext
#conf = pyspark.SparkConf().set<conf settings>


if sys.argv[1] == 'local':
    print 'Esecuzine in modalita local file'
    sc = SparkContext('spark://192.168.2.11:7077','Test Local file')
    rdd = sc.textFile('/root/test2')
else:
    print 'Esecuzine in modalita hdfs'
    sc = SparkContext('spark://192.168.2.11:7077','Test HDFS file')
    rdd = sc.textFile('hdfs://192.168.2.11:9000/data/test2')


rdd1 = rdd.flatMap(lambda x: x.split(' ')).map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y)
topFive = rdd1.takeOrdered(5,key=lambda x: -x[1])
print topFive
arj
  • 713
  • 2
  • 12
  • 26
  • it can depend on many things. How big is your cluster? What cluster manager do you use? Any custom settings? What Spark version? Can you show your code? – sgvd Jan 13 '16 at 14:07
  • i answer in the question's space. – arj Jan 13 '16 at 15:02

3 Answers3

1

It is a bit counter intuitive, but since the replication factor is 3 and you have 16 nodes, each node has on average 20% of the data stored locally in the HDFS. Then approximately 6 worker nodes should be sufficient on average to read the entire file without any network transfer.

If you record the running time vs number of worker nodes you should notice that after around 6 there will be no difference between reading from the local FS and from HDFS.

The above computation can be done using variables, e.g. x=number of worker nodes, y= replication factor, but you can see easily that since reading from the local FS imposes that the file is on all nodes you end up with x=y and there will be no difference after floor(x/y) nodes used. This is exactly what you are observing and it seems counter intuitive at first. Would you use replication factor 100% in production?

Radu Ionescu
  • 3,462
  • 5
  • 24
  • 43
  • Changing rep factor but not number of workers doesn't change the time. With 6 Worker repfactor 3 and 6 datanode time increases up to 1minute and 30 seconds. – arj Jan 14 '16 at 14:05
  • How did you configure this? Did you restarted your cluster? You say in the description you have 16 slaves. – Radu Ionescu Jan 14 '16 at 14:48
  • Rep fact change attempt: i've changed the rep fact of a file from 2 up to 16. Program submit to 16 slaves. Number of nodes attempt: i've reconfigured the whole cluster (spark and hadoop) with only 6 nodes. – arj Jan 15 '16 at 07:43
1

What are the parameters specific to Executor, Driver and RDD (with respect to Spilling ans storage level) ?

From Spark documentation

Performance Impact

The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O. To organize data for the shuffle, Spark generates sets of tasks - map tasks to organize the data, and a set of reduce tasks to aggregate it. This nomenclature comes from MapReduce and does not directly relate to Spark’s map and reduce operations.

Certain shuffle operations can consume significant amounts of heap memory since they employ in-memory data structures to organize records before or after transferring them. Specifically, reduceByKey and aggregateByKey create these structures on the map side, and 'ByKey operations generate these on the reduce side. When data does not fit in memory Spark will spill these tables to disk, incurring the additional overhead of disk I/O and increased garbage collection.

I am interested on memory/CPU core limits for Spark Job Vs memory/CPU core limits for Map & Reduce tasks.

Key parameters to benchmark from Hadoop:

yarn.nodemanager.resource.cpu-vcores
mapreduce.map.cpu.vcores
mapreduce.reduce.cpu.vcores
mapreduce.map.memory.mb
mapreduce.reduce.memory.mb
mapreduce.reduce.shuffle.memory.limit.percent

Key parameters to benchmark SPARK params against Hadoop for equivalence.

spark.driver.memory
spark.driver.cores
spark.executor.memory
spark.executor.cores
spark.memory.fraction

These are just some of the key parameters. Have a look at detailed set from SPARK and Map Reduce

Without having right set of parameters, we can't compare the performance of jobs across two different technologies.

Ravindra babu
  • 37,698
  • 11
  • 250
  • 211
0

It's because how the data is distributed, a single document isn't a good option, there are several better alternatives like parquet, if you do so you will notice that the performance will be noticeable raised, this is because the way how the file is partitioned, allows that your Apache Spark cluster will read those parts in parallel, therefore boosting the performance.

Alberto Bonsanto
  • 17,556
  • 10
  • 64
  • 93