1

I have some data frame like the below and want do the regression expressed in the below formula on it.

using GLM, StatsModels, Tables, DataFrames
training = DataFrame(yy = [1,2,3,7,5,4,2,3], continuous = [5,5,6,6,7,8,8,9], categorical = [:a,:a,:b,:b, :a,:c,:c,:b], bool = [true, false, true, true, true, false, false, true])
f = Term(:yy)~Term(:continuous) + Term(:categorical) + Term(:bool)

I can do this with the below (for a simple linear model I know I can directly input the formula and dataframe to a fit function - ignore this please as for the regression model I am actually doing that interface doesn't exist and I need to make the X matrix which is below as mm.m):

cols = Tables.columntable(training)
mf = StatsModels.ModelFrame(f, cols, model=GLM.LinearModel)
mm = StatsModels.ModelMatrix(mf)
fitted = fit(GLM.LinearModel, mm.m, response(mf))

The issue is that later on I need to use this fitted model to predict with a testset that might have different categorical values. For instance the below test dataframe does not have the categorical level of :b and so using StatsModels.ModelMatrix as above would make a matrix that doesn't line up with the original.

test = DataFrame(continuous = [5,5,6,6,7,8,8,9], categorical = [:a,:a,:a,:a, :a,:c,:c,:a], bool = [false, false, false, false, false, false, false, false])

Is there a way to:

  1. Get the variable names for the StatsModels.ModelMatrix.m matrix (mm.m above)
  2. Reproduce the same mapping from dataframe to matrix for a dataframe that might have different categorical values than the dataframe/matrix on which the model was trained.
Stuart
  • 1,322
  • 1
  • 13
  • 31
  • I'm not sure I understand what your constraints are, but if you just have `mm.m` then the answer to 1 is straightforwardly "no" - it's just a `Matrix{Float64}` so there's no information about its meaning stored anywhere. I guess 2 might be possible with a bit of elbow grease and looking at the StatsModels pipeline for constructing the X matrix, but that's probably a Discourse discussion... – Nils Gudat Aug 29 '23 at 08:24
  • Thanks @NilsGudat. Yeah I agree it is not possible in the current framework. I did write everything myself to map from a formula to a matrix in a reproducible way for my case which was not too painful (what I did is pretty bespoke for what I was doing though so I dont think it is worth sharing). Will leave question open in case this becomes possible in future version of StatsModels. – Stuart Aug 29 '23 at 19:23

0 Answers0