I have a Spark cluster and a Hdfs on the same machines. I've copied a single text file, about 3Gbytes, on each machine's local filesystem and on hdfs distributed filesystem.
I have a simple word count pyspark program.
If i submit the program reading the file from local filesystem, it lasts about 33 sec. If i submit the program reading the file from hdfs, it lasts about 46 sec.
Why ? I expected exactly the opposite result.
Added after sgvd's request:
16 slaves 1 master
Spark Standalone with no particular settings (replication factor 3)
Version 1.5.2
import sys
sys.path.insert(0, '/usr/local/spark/python/')
sys.path.insert(0, '/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip')
import os
os.environ['SPARK_HOME']='/usr/local/spark'
os.environ['JAVA_HOME']='/usr/local/java'
from pyspark import SparkContext
#conf = pyspark.SparkConf().set<conf settings>
if sys.argv[1] == 'local':
print 'Esecuzine in modalita local file'
sc = SparkContext('spark://192.168.2.11:7077','Test Local file')
rdd = sc.textFile('/root/test2')
else:
print 'Esecuzine in modalita hdfs'
sc = SparkContext('spark://192.168.2.11:7077','Test HDFS file')
rdd = sc.textFile('hdfs://192.168.2.11:9000/data/test2')
rdd1 = rdd.flatMap(lambda x: x.split(' ')).map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y)
topFive = rdd1.takeOrdered(5,key=lambda x: -x[1])
print topFive