I have code similar to what follows:
val fileContent = sc.textFile("file:///myfile")
val dataset = fileContent.map(row => {
val explodedRow = row.split(",").map(s => s.toDouble)
new LabeledPoint(explodedRow(13), Vectors.dense(
Array(explodedRow(10), explodedRow(11), explodedRow(12))
))})
val algo = new LassoWithSGD().setIntercept(true)
val lambda = 0.0
algo.optimizer.setRegParam(lambda)
algo.optimizer.setNumIterations(100)
algo.optimizer.setStepSize(1.0)
val model = algo.run(dataset)
I'm running this in the cloud on my virtual server with 20 cores. The file is a "local" (i.e. not in HDFS) file with a few million rows. I run this in local mode, with sbt run (i.e. I don't use a cluster, I don't use spark-submit).
I would have expected this to get be increasingly faster as I increase the spark.master=local[*] setting from local[8] to local[40]. Instead, it takes the same amount of time regardless of what setting I use (but I notice from the Spark UI that my executor has a maximum number of Active Tasks at any given time that is equal to the expected amount, i.e. ~8 for local[8], ~40 for local[40], etc. -- so it seems that the parallelization works).
By default the number of partitions my dataset RDD is 4. I tried forcing the number of partitions to 20, without success -- in fact it slows the Lasso algorithm down even more...
Is my expectation of the scaling process incorrect? Can somebody help me troubleshoot this?