1

I have code similar to what follows:

val fileContent = sc.textFile("file:///myfile")

val dataset = fileContent.map(row => {
    val explodedRow = row.split(",").map(s => s.toDouble)

    new LabeledPoint(explodedRow(13), Vectors.dense(

    Array(explodedRow(10), explodedRow(11), explodedRow(12))
))})

val algo = new LassoWithSGD().setIntercept(true)

val lambda = 0.0
algo.optimizer.setRegParam(lambda)
algo.optimizer.setNumIterations(100)
algo.optimizer.setStepSize(1.0)

val model = algo.run(dataset)

I'm running this in the cloud on my virtual server with 20 cores. The file is a "local" (i.e. not in HDFS) file with a few million rows. I run this in local mode, with sbt run (i.e. I don't use a cluster, I don't use spark-submit).

I would have expected this to get be increasingly faster as I increase the spark.master=local[*] setting from local[8] to local[40]. Instead, it takes the same amount of time regardless of what setting I use (but I notice from the Spark UI that my executor has a maximum number of Active Tasks at any given time that is equal to the expected amount, i.e. ~8 for local[8], ~40 for local[40], etc. -- so it seems that the parallelization works).

By default the number of partitions my dataset RDD is 4. I tried forcing the number of partitions to 20, without success -- in fact it slows the Lasso algorithm down even more...

Is my expectation of the scaling process incorrect? Can somebody help me troubleshoot this?

zero323
  • 322,348
  • 103
  • 959
  • 935
Karl Hall
  • 13
  • 2

1 Answers1

4

Is my expectation of the scaling process incorrect?

Well, kind of. I hope you don't mind I use a little bit of Python to prove my point.

  1. Lets be generous and say a few million rows is actually ten million. With 40 000 000 values (intercept + 3 features + label per row) it gives around 380 MB of data (Java Double is a double-precision 64-bit IEEE 754 floating point). Lets create some dummy data:

    import numpy as np
    
    n = 10 * 1000**2
    X = np.random.uniform(size=(n, 4))  # Features
    y = np.random.uniform(size=(n, 1))  # Labels
    theta = np.random.uniform(size=(4, 1))  # Estimated parameters
    
  2. Each step of gradient descent (since default miniBatchFraction for LassoWithSGD is 1.0 it is not really stochastic) ignoring regularization requires operation like this.

    def step(X, y, theta):
        return ((X.dot(theta) - y) * X).sum(0)
    

    So lets see how long it takes locally on our data:

    %timeit -n 15 step(X, y, theta)
    ## 15 loops, best of 3: 743 ms per loop
    

    Less than a second per step, without any additional optimizations. Intuitively it is pretty fast and it won't be easy to match this. Just for fun lets see how much it takes to get closed form solution for data like this

    %timeit -n 15 np.linalg.inv(X.transpose().dot(X)).dot(X.transpose()).dot(y)
    ## 15 loops, best of 3: 1.33 s per loop
    
  3. Now lets go back to Spark. Residuals for a single point can be computed in parallel. So this is a part which scales linearly when you increase number of partitions which are processed in parallel.

    Problem is that you have to aggregate data locally, serialize, transfer to the driver, deserialize and reduce locally to get a final result after each step. Then you have compute new theta, serialize send back and so on.

    All of that can be improved by a proper usage of mini batches and some further optimizations but at the end of the day you are limited by a latency of a whole system. It is worth noting that when you increase parallelism on a worker side you also increase amount of work that has to be performed sequentially on a driver and the other way round. One way or another the Amdahl's law will bite you.

    Also all of the above ignores actual implementation.

    Now lets perform another experiment. First some dummy data:

    nCores = 8  # Number of cores on local machine I use for tests
    rdd = sc.parallelize([], nCores)
    

    and bechmark:

    %timeit -n 40 rdd.mapPartitions(lambda x: x).count()
    ## 40 loops, best of 3: 82.3 ms per loop
    

    It means that with 8 cores, without any real processing or network traffic we get to the point where we cannot do much better by increasing parallelism in Spark (743ms / 8 = 92.875ms per partition assuming linear scalability of the parallelized part)

Just to summarize above:

  • if data can be easily processed locally with a closed-form solution using gradient descent is just a waste of time. If you want to increase parallelism / reduce latency you can use good linear algebra libraries
  • Spark is designed to handle large amounts of data not to reduce latency. If your data fits in a memory of a few years old smartphone it is a good sign that is not the right tool
  • if computations are cheap then constant costs become a limiting factor

Side notes:

  • relatively large number of cores per machine is generally speaking not the best choice unless you can match this with IO throughput
zero323
  • 322,348
  • 103
  • 959
  • 935