5

I am having a problem using the predict() function for a mgcv::gam (training) model on a new (testing) dataset. The problem arises due to a mrf smooth I have integrated to account for the spatial nature of my data.

I use the following call to create my GAM model

## Run GAM with MRF
m <- gam(crime ~ s(district,k=nrow(traindata),
                 bs ='mrf',xt=list(nb=nbtrain)), #define MRF smooth
     data = traindata,
     method = 'REML', 
     family = scat(), #fit scaled t distribution
     gamma = 1.4
)

where I predict the dependent variable crime using the neighbourhood structure, parsed into the model in the smooth term argument xt. The neighbourhood structure comes as a nb object that I created using the poly2nb() function.

Now, if I want to use predict() on a new testing dataset, I don't know how to pass the according neighbourhood structure into the call. Providing just the new data

pred <- predict.gam(m,newdata=testdata)

throws the following error:

Error in predict.gam(m, newdata = testdata) :
7, 16, 20, 28, 35, 36, 37, 43 not in original fit

Here's a full reproduction of the error using the Columbus dataset called from within R directly:

#ERROR REPRODUCTION

## Load packages
require(mgcv)
require(spdep)
require(dplyr)

## Load Columbus Ohio crime data (see ?columbus for details and credits)
data(columb.polys) #Columbus district shapes list
columb.polys <- lapply(columb.polys,na.omit) #omit NAs (unfortunate problem with the Columbus sample data)
data(columb) #Columbus data frame

df <- data.frame(district=numeric(0),x=numeric(0),y= numeric(0)) #Create empty df to store x, y and IDs for each polygon

## Extract x and y coordinates from each polygon and assign district ID
for (i in 1:length(columb.polys)) {
  district <- i-1
  x <- columb.polys[[i]][,1]
  y <- columb.polys[[i]][,2]
  df <- rbind(df,cbind(district,x,y)) #Save in df data.frame
}

## Convert df into SpatialPolygons
sp <- df %>%
       group_by(district) %>%
       do(poly=select(., x, y) %>%Polygon()) %>%
       rowwise() %>%
       do(polys=Polygons(list(.$poly),.$district)) %>%
       {SpatialPolygons(.$polys)}

## Merge SpatialPolygons with data
spdf <- SpatialPolygonsDataFrame(sp,columb)

## Split into training and test sample (80/20 ratio)
splt <- sample(1:2,size=nrow(spdf),replace=TRUE,prob=c(0.8,0.2))
train <- spdf[splt==1,] 
test <- spdf[splt==2,]

## Prepapre both samples and create NB objects
traindata <- train@data #Extract data from SpatialPolygonsDataFrame
testdata <- test@data
traindata <- droplevels(as(train, 'data.frame')) #Drop levels
testdata <- droplevels(as(test, 'data.frame'))
traindata$district <- as.factor(traindata$district) #Factorize
testdata$district <- as.factor(testdata$district)
nbtrain <- poly2nb(train, row.names=train$Precinct, queen=FALSE) #Create NB objects for training and test sample
nbtest <- poly2nb(test, row.names=test$Precinct, queen=FALSE)
names(nbtrain) <- attr(nbtrain, "region.id") #Set region.id
names(nbtest) <- attr(nbtest, "region.id")

## Run GAM with MRF
m <- gam(crime ~ s(district, k=nrow(traindata), bs = 'mrf',xt = list(nb = nbtrain)), # define MRF smooth
         data = traindata,
         method = 'REML', # fast version of REML smoothness selection; alternatively 'GCV.Cp'
         family = scat(), #fit scaled t distribution
         gamma = 1.4
)

## Run prediction using new testing data
pred <- predict.gam(m,newdata=testdata)
  • The problem is the `district` variable, which is a factor. There are factors in the test data (i.e., `7, 16, 20, 28, 35, 36, 37, 43`) that don't appear in the training data and, consequently, no predictions can be made. If you omit those, you should be able to make predictions. – Dan Jan 31 '18 at 16:23
  • If I create the test sample as a subset of the training sample it does work; is that what you meant? Else, can you think of a workaround that would allow me to have independent training and testing samples? – Konstantin Klemmer Jan 31 '18 at 16:30
  • I'd just omit `district` from your model, as it's not needed. As far as I can see, it's just an ID tag. It's the other data that you're actually modelling. – Dan Jan 31 '18 at 16:37
  • Yes, the `district` is just an ID and not needed. But there's no real predictor variable. The model estimates the value of `crime` in each area as a function of the `crime` values in the neighbouring areas. Also, the model doesn't allow me to just drop `district` in the `s()` call unfortunately. (It throws a subscript out of bounds error) – Konstantin Klemmer Jan 31 '18 at 16:45
  • But surely the predictors should be location (i.e., `x`, `y`), home value, open space, etc.? – Dan Jan 31 '18 at 16:52
  • @Lyngbakr that's not how an MRF works; the spatial association is given via the neighbours of each node in the graph, and the Markov property says that two nodes are conditionally independent if they don't share a neighbour (or something like that). – Gavin Simpson Jan 31 '18 at 16:58
  • I see, sorry I should have clarified that: as soon as there will be an independent variable (house price,...) the prediction would work. However, I am trying to create a prediction that is solely based on the spatial process (i.e. the neighbourhood structure). – Konstantin Klemmer Jan 31 '18 at 17:01
  • 1
    I haven't looked into this much; From this [discussion](http://r.789695.n4.nabble.com/MRF-smoothers-in-MGCV-specifying-neighbours-td4690164.html) on R-Help from a few years back, Simon Wood is suggesting fitting the model with the full data but using a zero weight for the training observations. – Gavin Simpson Jan 31 '18 at 17:03
  • Further evidence that I shouldn't comment on things I don't understand... :) – Dan Jan 31 '18 at 17:04
  • @Lyngbakr `district` needs to be a factor for the constructor to work. I also don't see what difference it not being a factor would make; it is supposed to to be the area id (in this case; node ID in general) for each observation and you can have multiple observations for a node (think spatiotemporal data). – Gavin Simpson Jan 31 '18 at 17:05
  • When I said training observations should have a zero weight I mean the test observations should have a zero weight. Sorry! – Gavin Simpson Jan 31 '18 at 17:07
  • @GavinSimpson My point (before I realised that I didn't know what I was talking about) was that the test data had no overlap at all with the training data with respect to `district`, so predict had no idea what to do with it since it was given previously unknown factors. If it was a numeric (e.g., districts are set out linearly in space with 4 between 3 and 5) then at least the GAM could have a crack at a prediction, right? (And apologies for deleting the comment that you replied to!) – Dan Jan 31 '18 at 17:12
  • Thanks, I will look into that discussion. Any other suggestions for workarounds are appreciated! My final aim btw is to run a k-fold cross validation using random training and test samples so if someone can suggest a way to work that out without the `predict()` call it would also be of great help! – Konstantin Klemmer Jan 31 '18 at 17:14
  • @Lyngbakr right; unless you have all the neighbour info you can't predict, but that doesn't, I think, mean that you have to fit the model with all known regions. Regions missing from the fit can be put into the graph and their prediction will be based on the neighbours of the node. In other words, I think the math works fine as long as you know the neighbours of the new node. As implemented in `mgcv::gam()` this isn't possible from what I can tell. You might be able to wing it if you pass the penalty matrix yourself? – Gavin Simpson Jan 31 '18 at 17:19
  • The nodes are like random effects and unobserved levels are pulled to population values (or neighbour node values) – Gavin Simpson Jan 31 '18 at 17:20
  • I guess I will look into manually calculating the penalty matrix; but that seems rather complicated. Thanks for the tip though! – Konstantin Klemmer Jan 31 '18 at 17:37
  • 1
    @KonstantinKlemmer Note that I'm not even sure that [manually building the penalty matrix] works. Try adding a vector `ind` to the data indicating if a row is a train or test sample, then add a weight vector `wt <- ind / mean(ind)`, then pass `weights = wt` in your `gam()` model. This is normalising the weights so it doesn't change the log-likelihood of the model. – Gavin Simpson Jan 31 '18 at 18:52
  • Thanks! I will try to implement that and give feedback on this approach. – Konstantin Klemmer Jan 31 '18 at 19:13
  • 1
    @GavinSimpson First of all, thanks for the help! I have established your solution with the normalized weights and it throws a warning: `Warning messages: 1: In gam.fit4(x, y, sp, Eb, UrS = UrS, weights = weights,...:Non-finite coefficients at iteration 2 13: In newton(lsp = lsp, X = G$X, y = G$y, Eb = G$Eb, UrS = G$UrS,... : Fitting terminated with step failure - check results carefully`. The prediction then jumps to 100% explained variance. However, if I use the non-normalized weights `ind`, it works. Do we really need to normalize here? – Konstantin Klemmer Feb 01 '18 at 15:32
  • 1
    @KonstantinKlemmer Oops, right; I think you need un-normalized weights in this case as you want the locations with no data to have 0 weight. (I defaulted to normalised as I've been fitting GAMs with weights recently where without normalisation I was getting overly small standard errors.) Glad you have a version working for you! – Gavin Simpson Feb 01 '18 at 16:08
  • @GavinSimpson Great! Last question: Do you think this method for a robust cross validation to test for model overfitting? – Konstantin Klemmer Feb 01 '18 at 16:54

2 Answers2

3

SOLUTION:

I finally found the time to update this post with the solution. Thanks to everyone for helping me out. Here is the code for implementing k-fold CV with a random training-testing split:

#Apply k-fold cross validation
mses <- data.frame() #Create empty df to store CV squared error values
scores <- data.frame() #Create empty df to store CV R2 values
set.seed(42) #Set seed for reproducibility
k <- 10 #Define number of folds
for (i in 1:k) {
  # Create weighting column
  data$weight <- sample(c(0,1),size=nrow(data),replace=TRUE,prob=c(0.2,0.8)) #0 Indicates testing sample, 1 training sample

  #Run GAM with MRF
  ctrl <- gam.control(nthreads = 6) #Set controls
  m <- gam(crime ~ s(disctrict, k=nrow(data), bs = 'mrf',xt = list(nb = nb)), #define MRF smooth
            data = data,
            weights = data$weight, #Use only weight==1 observations (training)
            method = 'REML', 
            control = ctrl,
            family = scat(), 
            gamma = 1.4
           )
  #Generate test dataset
  testdata <- data[data$weight==0,] #Select test data by weight
  #Predict test data
  pred <- predict(m,newdata=testdata)
  #Extract MSES
  mses[i,1] <- mean((data$R_MeanDiff[data$weight==0] - pred)^2)
  scores[i,1] <- summary(m)$r.sq
}
av.mse.GMRF <- mean(mses$V1)
av.r2.GMRF <- mean(scores$V1)
1

I have one question criticism with the current solution, being that the full dataset was used to "train" the model meaning that the predictions are going to be biased since the testdata was used to train it.

This only requires a couple minor tweaks to fix:

#Apply k-fold cross validation
mses <- data.frame() #Create empty df to store CV squared error values
scores <- data.frame() #Create empty df to store CV R2 values
set.seed(42) #Set seed for reproducibility
k <- 10 #Define number of folds

#For loop for each fold
for (i in 1:k) {

  # Create weighting column
  data$weight <- sample(c(0,1),size=nrow(data),replace=TRUE,prob=c(0.2,0.8)) #0 Indicates testing sample, 1 training sample

  #Generate training dataset
  trainingdata <- data[data$weight == 1, ] #Select test data by weight  

  #Generate test dataset
  testdata <- data[data$weight == 0, ] #Select test data by weight


  #Run GAM with MRF
  ctrl <- gam.control(nthreads = 6) #Set controls
  m <- gam(crime ~ s(disctrict, k=nrow(data), bs = 'mrf',xt = list(nb = nb)), #define MRF smooth
            data    = trainingdata,
            weights = data$weight, #Use only weight==1 observations (training)
            method  = 'REML', 
            control = ctrl,
            family  = scat(), 
            gamma   = 1.4
           )

  #Predict test data
  pred <- predict(m,newdata = testdata)

  #Extract MSES
  mses[i,1] <- mean((data$R_MeanDiff[data$weight==0] - pred)^2)
  scores[i,1] <- summary(m)$r.sq
}

#Get average scores from each k-fold test
av.mse.GMRF <- mean(mses$V1)
av.r2.GMRF <- mean(scores$V1)

Adam Kemberling
  • 301
  • 1
  • 11