1

I'm working with SparkR , and I need to know how to predict new value and accuracy of them.

This is the input, sample of data.csv

Classes ‘data.table’ and 'data.frame':  100 obs. of  8 variables:
 $ LINESET     : chr  "DG1000420" "DG1000420" "DG1000420" "DG1000420" ...
 $ TIMEINTERVAL: int  1383378600 1383394800 1383556800 1383679200 1383695400 1383718800 1383857400 1383873600 1383996000 1384269600 ...
 $ SmsIn       : num  77.4 76.6 99.2 63.7 10.7 ...
 $ SmsOut      : num  47.74 48.56 26.08 62.39 9.43 ...
 $ CallIn      : num  19.602 31.509 38.003 23.206 0.707 ...
 $ CallOut     : num  22.93 34.97 71.64 37.23 1.61 ...
 $ Internet    : num  435 502 363 465 295 ...
 $ ValueAmp    : num  39.8 32.9 81.4 94.3 54.2 ...

My model is

glm(ValueAmp~SmsIn+SmsOut+CallIn+CallOut+Internet+TIMEINTERVAL,data=Consumi,family="gaussian")

I would like to know which are the new values of ValueAmp and accuracy of them.

I tried to do something like this, as databricks said, but is not what I looking for I think, on errors I got values that go from like -30 / +40 Is not so accurate?

training<-createDataFrame(sqlContext,Consumi)
model <- glm(ValueAmp ~SmsIn+SmsOut+CallIn+CallOut+Internet,
             family = "gaussian", data =training)
summary(model)
preds<- predict(model,training)
errors <- select(
    preds, preds$label, preds$prediction, preds$LINESET,
    alias(preds$label - preds$prediction, "error"))

So there is a way in R or SparkR (preferably) to estimate new values with good accuracy?

zero323
  • 322,348
  • 103
  • 959
  • 935
DanieleO
  • 462
  • 1
  • 7
  • 20

1 Answers1

1

First of all you have understand a difference between Spark(R) linear models and local linear models provided by tools like R. In general it is a difference between an approximation (usually achieved using some variant of Gradient Descent) versus an exact analytical solution. While further ones guarantee an optimal solution there are usually to expensive to use on large datasets. Former ones scale very well but provide only weak guarantees and can be highly dependent on input parameters.

In general when you use Gradient Descent you have to adjust model parameters. In case of SparkR and linear regression (Gaussian model) these are:

  • alpha - elastic-net mixing parameter
  • lambda - regularization parameter
  • solver - exact algorithm which is used to train the model

After you choose the solver remaining parameters have to be tuned, usually using some variant of hyperparameter optimization. Unfortunately there is no universal method and a lot depends on a specific dataset.

See also:

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935