11

I've recently been trying to get to know Apache Spark as a replacement for Scikit Learn, however it seems to me that even in simple cases, Scikit converges to an accurate model far faster than Spark does. For example I generated 1000 data points for a very simple linear function (z=x+y) with the following script:

from random import random

def func(in_vals):
    '''result = x (+y+z+w....)'''
    result = 0
    for v in in_vals:
        result += v
    return result

if __name__ == "__main__":
    entry_count = 1000
    dim_count = 2
    in_vals = [0]*dim_count
    with open("data_yequalsx.csv", "w") as out_file:
        for entry in range(entry_count):
            for i in range(dim_count):
                in_vals[i] = random()
            out_val = func(in_vals)
            out_file.write(','.join([str(x) for x in in_vals]))
            out_file.write(",%s\n" % str(out_val))

I then ran the following Scikit script:

import sklearn
from sklearn import linear_model

import numpy as np

data = []
target = []
with open("data_yequalsx.csv") as inFile:
    for row in inFile:
        vals = row.split(",")
        data.append([float(x) for x in vals[:-1]])
        target.append(float(vals[-1]))

test_samples= len(data)/10

train_data = [0]*(len(data) - test_samples)
train_target = [0]*(len(data) - test_samples)
test_data = [0]*(test_samples)
test_target = [0]*(test_samples)
train_index = 0
test_index = 0
for j in range(len(data)):
    if j >= test_samples:
        train_data[train_index] = data[j]
        train_target[train_index] = target[j]
        train_index += 1
    else:
        test_data[test_index] = data[j]
        test_target[test_index] = target[j]
        test_index += 1

model = linear_model.SGDRegressor(n_iter=100, learning_rate="invscaling", eta0=0.0001, power_t=0.5, penalty="l2", alpha=0.0001, loss="squared_loss")
model.fit(train_data, train_target)
print(model.coef_)
print(model.intercept_)

result = model.predict(test_data)
mse = np.mean((result - test_target) ** 2)
print("Mean Squared Error = %s" % str(mse))

And then this Spark script: (with spark-submit , no other arguments)

from pyspark.mllib.regression import LinearRegressionWithSGD, LabeledPoint
from pyspark import SparkContext

sc = SparkContext (appName="mllib_simple_accuracy")

raw_data = sc.textFile ("data_yequalsx.csv", minPartitions=10) #MinPartitions doesnt guarantee that you get that many partitions, just that you wont have fewer than that many partitions
data = raw_data.map(lambda line: [float(x) for x in line.split (",")]).map(lambda entry: LabeledPoint (entry[-1], entry[:-1])).zipWithIndex()
test_samples= data.count()/10

training_data = data.filter(lambda (entry, index): index >= test_samples).map(lambda (lp,index): lp)
test_data = data.filter(lambda (entry, index): index < test_samples).map(lambda (lp,index): lp)

model = LinearRegressionWithSGD.train(training_data, step=0.01, iterations=100, regType="l2", regParam=0.0001, intercept=True)
print(model._coeff)
print(model._intercept)

mse = (test_data.map(lambda lp: (lp.label - model.predict(lp.features))**2 ).reduce(lambda x,y: x+y))/test_samples;
print("Mean Squared Error: %s" % str(mse))

sc.stop ()

Strangely though, the error given by spark is an order of magnitude larger than that given by Scikit (0.185 and 0.045 respectively) despite the two models having a nearly identical setup (as far as I can tell) I understand that this is using SGD with very few iterations and so the results may differ but I wouldn't have thought that it would be anywhere near such a large difference or such a large error, especially given the exceptionally simple data.


Is there something I'm misunderstanding in Spark? Is it not correctly configured? Surely I should be getting a smaller error than that?

JacquesH
  • 161
  • 1
  • 7
  • 1
    I suggest you provide error bounds by repeating the experiment multiple times with different random seeds and check if you get the same result; 1000 data points and 100 iterations is not a lot. Furthermore, do sklearn and mllib use the same learning rate schedule for SGD? you use invscaling for sklearn but is mllib using the same? – Peter Prettenhofer Jan 22 '15 at 21:08

2 Answers2

3

SGD, which stands for Stochastic Gradient Descent, is an online convex optimization algorithm, and therefore very difficult to parallelize, since it makes one update per iteration (there are smarter variantes such as SGD with mini-batches, but still not very good for parallel environment.

On the other hand, batch algorithms, such as L-BFGS, hwich I advise you to use with Spark (LogigisticRegressionWithLBFGS), can be easily parallelized, since it makes on iteration per epoch (it need to see all datapoints, calculate the value and gradient of the loss function of each point, then performs the aggregation to calculate the full gradient).

Python is ran in a single machine, therefore SGD performs well.

By the way, if you look in MLlib code, the equivalent of scikit learn's lambda is lambda/size of the dataset (mllib optimizes 1/n*sum(l_i(x_i,f(y_i)) + lambda while scikit learn optimizes sum(l_i(x_i,f(y_i)) + lambda

Oussama
  • 3,214
  • 1
  • 11
  • 9
2

Because Spark is parallelized, each node needs to be able to work independently of the other nodes when the computation is under-way to avoid [time-]expensive shuffles between the nodes. Consequently, it uses a procedure called Stochastic Gradient Descent to approach a minimum, which follows local gradients downwards.

The 'exact' way to solve a [simple, least-squares] regression problem involves solving a matrix equation. This is probably what Scikit-Learn is doing, so in this case it will be more accurate.

The trade-off is that solving matrix equations generally scales as N^3 for a size-N square matrix, which rapidly becomes unfeasible for large datasets. Spark swaps accuracy for computational power. As with any machine-learning procedure, you should build in LOTS of sanity checks throughout your algorithms to make sure that the results of the previous step make any sense.

Hope this helps!

StackG
  • 2,730
  • 5
  • 29
  • 45