We're just trialling Spark, and it's proving really slow. To show what I mean, I've given an example below - it's taking Spark nearly 2 seconds to load in a text file with ten rows from HDFS, and count the number of lines. My questions:
- Is this expected? How long does it take your platform?
- Any possible ideas why? Currently I'm using Spark 1.3 on a two node Hadoop cluster (both 8 cores, 64G RAM). I'm pretty green when it comes to Hadoop and Spark, so I've done little configuration beyond the Ambari/HDP defaults.
Initially I was testing on a hundred million rows - Spark was taking about 10 minutes to simply count it.
Example:
Create text file of 10 numbers, and load it into hadoop:
for i in {1..10}; do echo $1 >> numbers.txt; done
hadoop fs -put numbers.txt numbers.txt
Start pyspark (which takes about 20 seconds ...):
pyspark --master yarn-client --executor-memory 4G --executor-cores 1 --driver-memory 4G --conf spark.python.worker.memory=4G
Load the file from HDFS and count it:
sc.textFile('numbers.txt').count()
According to the feedback, it takes Spark around 1.6 seconds to do that. Even with terrible configuration, I wouldn't expect it to take that long.