5

We're just trialling Spark, and it's proving really slow. To show what I mean, I've given an example below - it's taking Spark nearly 2 seconds to load in a text file with ten rows from HDFS, and count the number of lines. My questions:

  1. Is this expected? How long does it take your platform?
  2. Any possible ideas why? Currently I'm using Spark 1.3 on a two node Hadoop cluster (both 8 cores, 64G RAM). I'm pretty green when it comes to Hadoop and Spark, so I've done little configuration beyond the Ambari/HDP defaults.

Initially I was testing on a hundred million rows - Spark was taking about 10 minutes to simply count it.

Example:

Create text file of 10 numbers, and load it into hadoop:

for i in {1..10}; do echo $1 >> numbers.txt; done
hadoop fs -put numbers.txt numbers.txt

Start pyspark (which takes about 20 seconds ...):

pyspark --master yarn-client --executor-memory 4G --executor-cores 1 --driver-memory 4G --conf spark.python.worker.memory=4G

Load the file from HDFS and count it:

sc.textFile('numbers.txt').count()

According to the feedback, it takes Spark around 1.6 seconds to do that. Even with terrible configuration, I wouldn't expect it to take that long.

  • 1
    Does that still happen the second time you run the count command? Spark takes a few seconds to do all its initialization and class loading. – sk. Nov 24 '15 at 05:54
  • It is considerably quicker a second time round (< 0.1s), though I was under the impression that was due to caching? Is there a way I can force Spark to initialize first, so I can test it post initialization? –  Nov 24 '15 at 07:36
  • 1
    Try creating numbers2.txt? :) – sk. Nov 25 '15 at 03:25

2 Answers2

2

This is definitly too slow (on my local machine 0.3 sec) even for bad spark configuration (moreover usualy default spark configuration apply to most of the normal use of it ). Maybe you should double check your HDFS configuration or network related configuration .

tiny sunlight
  • 6,231
  • 3
  • 21
  • 42
1

It has nothing to do with cluster configuration. It is due to lazy evaluation.

There are two types of APIs in Spark : Transformations & Actions

Have a look at it from above documentation link.

RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.

For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program.

sc.textFile('numbers.txt').count() is an action operation with count() call.

Due to this reason, even though it took 2 seconds at first time for you, it took fraction of seconds at second time.

Ravindra babu
  • 37,698
  • 11
  • 250
  • 211