0

In general, I want to compare the computing time between a large dataset and split datasets in Spark with the same learning algorithm. The other reason is that I want to get the partition model results.

However, the result shows that the original way is faster than the parallel method. In general, I predict the parallel running with split datasets, which is faster. However,I do not know how to set up it.

How can I adjust the parameters to get I want?

or Can I stop to use partitions using the original method in Spark?

The original:

val lr = new LogisticRegression()
val lrModel = lr.fit(training)

The parallel:

val lr = new LogisticRegression()
val split = training.randomSplit(Array(1,1,.....,1), 11L)
for (tran<-split)
  lrModels=lr.fit(train)
Martin TT
  • 301
  • 2
  • 16

1 Answers1

0

First snippet, "original" is also parallelized. To understand it, please look at a Spark execution model.

In first example, Spark have one large dataset. Spark splits it to partitions and calculate each partition in other thread. In second example, you split your data manually (of course internally data is splitted also to partitions). Then you invoke fit - however, in a loop, so this model will be calculated, then other one, etc. So "parallel" example is not more parallel than first one and I'm not suprised that first code runs faster.

In first example you are making one model, in other you are making few models. Each model building is invoked on few threads, however each fit() in second example is invoked just after previous calculation is made.

You can stop parallelism via repartition method with parameter value = 1, however it's not a solution to stop parallelism in first example. You have just shown, that iterative approach is slower than parallel :)

T. Gawęda
  • 15,706
  • 4
  • 46
  • 61