0

I'm using Databricks with the SparkR package to build a glm model. Everything seems to run ok except when I run summary(lm1). Instead of getting Variable, Estimate, Std.Error, t-value & p-value (see pic below - this is what I'd expect to see, NOT what I'm getting), I just get the variable and estimate. The only thing I can think is that the data set is big enough (train1 is 12 million rows and test1 is 6 million rows) that all estimates have 0 p-values. Any other reasons this would happen??

library(SparkR)


rdf <- sql("select * from myTable") #read data
train1 <- rdf[rdf$ntile_3 != 1,]    # split into test and train based on ntile in table
test1 <- rdf[rdf$ntile_3 == 1,]

vtu1 <- c('var1','var2','var3')

lm1 <- glm( target ~., train1[,c(vtu1,'target' )],family = 'gaussian')  
pred1 <- predict(lm1, test1)

summary(lm1)

enter image description here

screechOwl
  • 27,310
  • 61
  • 158
  • 267
  • The function `summary()` usually calls stats::summary.glm() but perhaps in your databricks env summary() calls a different function. Do you still get the 'weird' result if you use `stats::summary.glm(lm1)`? What are the p-values using `summ <- stats::summary.glm(lm1); coef(summ)`? – jared_mamrot Apr 30 '21 at 07:01
  • @jared_mamrot: I get an error - `Error : $ operator not defined for this S4 class` – screechOwl Apr 30 '21 at 07:05

1 Answers1

0

as you specify family = Gaussian in your model, your glm model seems to be equivalent to a standard linear regression model (analyzed by lm in R). For an extensive answer to your question, see for example here: https://stats.stackexchange.com/questions/187100/interpreting-glm-model-output-assessing-quality-of-fit If you specify your model using lm, you should get the output you want.

Ane
  • 335
  • 1
  • 11