5

I want to know the significance of each coefficient of a logistic regression model using spark function ml_logistic_regression. The code is as follows:

# data in R
library(MASS)
data(birthwt) 
str(birthwt)
detach("package:MASS", unload=TRUE)

# Connection to Spark
library(sparklyr)
library(dplyr)
sc = spark_connect(master = "local")

# copy the data to Spark
birth_sc = copy_to(sc, birthwt, "birth_sc", overwrite = TRUE)

# Model
# create dummy variables for race (race_1, race_2, race_3)
birth_sc = ml_create_dummy_variables(birth_sc, "race")
model = ml_logistic_regression(birth_sc, low ~ lwt + race_2 + race_3)

The model I get is the following:

> model
Call: low ~ lwt + race_2 + race_3

Coefficients:
(Intercept)         lwt      race_2      race_3 
 0.80575496 -0.01522311  1.08106617  0.48060322 

In an R model you use summary and it gives you the significance of the coefficients, but if I use it with this model I get the same result:

> summary(model)
Call: ml_logistic_regression(birth_sc, low ~ lwt + race_2 + race_3)

Coefficients:
  (Intercept)         lwt      race_2      race_3 
0.80575496 -0.01522311  1.08106617  0.48060322 

How could get the significance of each variable in the model?

user2554330
  • 37,248
  • 4
  • 43
  • 90
Joe
  • 561
  • 1
  • 9
  • 26
  • 2
    Looking at the structure of the model object (`str(model)`), it doesn't look like `ml_logistic_regression` returns any information related to significance levels, confidence intervals or the variance-covariance matrix so I'm not sure it's possible. On the other hand, in machine learning, where you're usually trying to maximize predictive performance, statistical significance generally isn't important. Instead, use cross-validation to select the best model based on some performance criterion (like area under the ROC curve). – eipi10 Nov 20 '17 at 22:07
  • @eipi10 I tried calculating the variance-covariance matrix by hand in Spark, but I get to the point I need to calculate an inverse of a matrix, but I don't know how to calculate it in Spark – Joe Nov 21 '17 at 12:08
  • `solve(m)` will return the inverse of a matrix `m`. – eipi10 Nov 21 '17 at 23:01

2 Answers2

3

You just don't. None of the Spark's LogisticRegressionSummary variants provide feature importances, therefore these cannot be (and as pointed out by eipi10 is not) provided by .

2

You might be able to get what you're looking for by using the Generalized Linear Model where the family == "Binomial." See http://spark.rstudio.com/reference/ml_generalized_linear_regression/ and the Spark Reference for more information: https://spark.apache.org/docs/2.1.0/ml-classification-regression.html#generalized-linear-regression

  • seems promising, but can you confirm that this object (unlike the logistic regression object) does contain information that will recover the p-values? Although https://stackoverflow.com/questions/48482245/calculating-standard-error-of-coefficients-for-logistic-regression-in-spark?rq=1 also suggests it does ... – Ben Bolker Aug 24 '18 at 03:56
  • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/low-quality-posts/20672231) – taras Aug 24 '18 at 05:57