2

I used ML PipeLine to run logistic regression models but for some reasons I got worst results than R. I have done some researches and the only post that I found that is related to this issue is this . It seems that Spark Logistic Regression returns models that minimize loss function while R glm function uses maximum likelihood. The Spark model only got 71.3% of the records right while R can predict 95.55% of the cases correctly. I was wondering if I did something wrong on the set up and if there's a way to improve the prediction. The below is my Spark code and R code-

Spark code

partial model_input  
label,AGE,GENDER,Q1,Q2,Q3,Q4,Q5,DET_AGE_SQ  
1.0,39,0,0,1,0,0,1,31.55709342560551  
1.0,54,0,0,0,0,0,0,83.38062283737028  
0.0,51,0,1,1,1,0,0,35.61591695501733



def trainModel(df: DataFrame): PipelineModel = {  
  val lr  = new LogisticRegression().setMaxIter(100000).setTol(0.0000000000000001)  
  val pipeline = new Pipeline().setStages(Array(lr))  
  pipeline.fit(df)  
}

val meta =  NominalAttribute.defaultAttr.withName("label").withValues(Array("a", "b")).toMetadata

val assembler = new VectorAssembler().
  setInputCols(Array("AGE","GENDER","DET_AGE_SQ",
 "QA1","QA2","QA3","QA4","QA5")).
  setOutputCol("features")

val model = trainModel(model_input)
val pred= model.transform(model_input)  
pred.filter("label!=prediction").count

R code

lr <- model_input %>% glm(data=., formula=label~ AGE+GENDER+Q1+Q2+Q3+Q4+Q5+DET_AGE_SQ,
          family=binomial)
pred <- data.frame(y=model_input$label,p=fitted(lr))
table(pred $y, pred $p>0.5)

Feel free to let me know if you need any other information. Thank you!

Edit 9/18/2015 I have tried increasing the maximum iteration and decreasing the tolerance dramatically. Unfortunately, it didn't improve the prediction. It seems the model converged to a local minimum instead of the global minimum.

Community
  • 1
  • 1
SH Y.
  • 1,709
  • 3
  • 20
  • 21
  • [This](http://stackoverflow.com/questions/28747019/comparison-of-r-statmodels-sklearn-for-a-classification-task-with-logistic-reg) may be relevant because Spark uses similar algorithms to sklearn. Worth trying to normalize your data before running LR. Also you can try [LBFGS instead of SGD](http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.classification.LogisticRegressionWithLBFGS) but in that case you have to use MLLib instead of ML. – max Jun 26 '16 at 19:46

1 Answers1

2

It seems that Spark Logistic Regression returns models that minimize loss function while R glm function uses maximum likelihood.

Minimization of a loss function is pretty much a definition of the linear models and both glm and ml.classification.LogisticRegression are no different here. Fundamental difference between these two is the way how it is achieved.

All linear models from ML/MLlib are based on some variants of Gradient descent. Quality of the model generated using this approach vary on a case by case basis and depend on the Gradient Descent and regularization parameters.

R from the other hand computes an exact solution which, given its time complexity, is not well suited for large datasets.

As I've mentioned above quality of the model generated using GS depends on the input parameters so typical way to improve it is to perform hyperparameter optimization. Unfortunately ML version is rather limited here compared to MLlib but for starters you can increase a number of iterations.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • I have tried this yet. But do you think that it would be possible to apply Rformula with Spark 1.5 to improve the quality model? – eliasah Sep 18 '15 at 07:35
  • I don't think so. As far I understand it is using MLlib under the hood. Still, since logistic regression loss function is convex adjusting parameters should more than enough. – zero323 Sep 18 '15 at 07:46
  • Ok, sounds logical to me! It just an idea that popped into my head. – eliasah Sep 18 '15 at 08:06