I have some data frame like the below and want do the regression expressed in the below formula on it.
using GLM, StatsModels, Tables, DataFrames
training = DataFrame(yy = [1,2,3,7,5,4,2,3], continuous = [5,5,6,6,7,8,8,9], categorical = [:a,:a,:b,:b, :a,:c,:c,:b], bool = [true, false, true, true, true, false, false, true])
f = Term(:yy)~Term(:continuous) + Term(:categorical) + Term(:bool)
I can do this with the below (for a simple linear model I know I can directly input the formula and dataframe to a fit function - ignore this please as for the regression model I am actually doing that interface doesn't exist and I need to make the X matrix which is below as mm.m
):
cols = Tables.columntable(training)
mf = StatsModels.ModelFrame(f, cols, model=GLM.LinearModel)
mm = StatsModels.ModelMatrix(mf)
fitted = fit(GLM.LinearModel, mm.m, response(mf))
The issue is that later on I need to use this fitted model to predict with a testset that might have different categorical values. For instance the below test dataframe does not have the categorical level of :b and so using StatsModels.ModelMatrix as above would make a matrix that doesn't line up with the original.
test = DataFrame(continuous = [5,5,6,6,7,8,8,9], categorical = [:a,:a,:a,:a, :a,:c,:c,:a], bool = [false, false, false, false, false, false, false, false])
Is there a way to:
- Get the variable names for the StatsModels.ModelMatrix.m matrix (
mm.m
above) - Reproduce the same mapping from dataframe to matrix for a dataframe that might have different categorical values than the dataframe/matrix on which the model was trained.