In general, I want to compare the computing time between a large dataset and split datasets in Spark with the same learning algorithm. The other reason is that I want to get the partition model results.
However, the result shows that the original way is faster than the parallel method. In general, I predict the parallel running with split datasets, which is faster. However,I do not know how to set up it.
How can I adjust the parameters to get I want?
or Can I stop to use partitions using the original method in Spark?
The original:
val lr = new LogisticRegression()
val lrModel = lr.fit(training)
The parallel:
val lr = new LogisticRegression()
val split = training.randomSplit(Array(1,1,.....,1), 11L)
for (tran<-split)
lrModels=lr.fit(train)