predict values from GAM for grouped dataframe in R

Question

I have a dataset of mean annual temperature values at different latitudes in different years. I want to use this to predict the latitude at which a given temperature could be found in a given year; i.e., "in 1980, at what latitude would the mean annual temperature have been 20C?"

I need to use year-specific models for this, because the relationship between latitude and temperature has changed over time (although not in the sample data below, which is randomly generated). This will involve:

Fit GAMs to a dataset grouped/split by year.
For each different GAM (that is, for each year), use predict.gam to calculate a predicted value for every element in a list of temperatures.
Recombine these to get a dataframe with columns representing year, newdata_value (the temperature value used for prediction), and predicted_value (latitude from feeding each newdata_value into the year-specific GAM).

Here's a toy dataset:

years <- seq(1968, 2018, 1)
lat <- seq(34.5, 44.5, 1)
dat <- expand.grid(years, lat)
names(dat) <- c("years","lat")
dat$temp <- runif(dim(dat)[1], 5, 20) # add random temperature data points 
newdata_values <- seq(2, 16, 2) # temperature values to use for prediction

I've tried various purrr and split-apply-combine solutions and haven't figured anything out. Any suggestions?

score 6 · Accepted Answer · edited Oct 30 '19 at 16:20

Another option is to fit a model that allows the lat/temp relationship to vary by year. There are several options for this. The following fits a model where each year has an independent relationship:

gam(lat ~ year + s(temp, by = year), data = dat)

Note that for this formulation year should be encoded as a factor.

An alternative would be to allow the lat/temp relationship to vary smoothly by year, a reasonable model if this relationship gradually changes over time. In this case you will want to use a tensor product smooth (te()) to indicate a two-way interaction between variables that are on different scales (degrees, years):

gam(lat ~ te(temp, year), data = dat)

In both cases you can then make a prediction with predict.gam(model, newdata = new_dat), where new_dat has both year and temp columns.

score 2 · Answer 2 · answered Oct 23 '19 at 21:26

One approach would be to use nested dataframes. I used code found in this tutorial.

You can group by year and use nest. I'll also rename columns and add the new values to predict:

library(tidyverse); library(mgcv)
names(dat) <- c('year', 'lat', 'temp')
dat2 <- dat %>% group_by(year) %>% nest()

dat2 <- dat2 %>% mutate(newdata_value = rep(list(newdata_values), n_distinct(year)))

You then define some helper functions to made the tidyverse code cleaner (I'm assuming your using gam from the mgcv package). Then map the model function to the data and map the predict function to the fitted models:

lat_gam <- function(df) {
  gam(lat ~ s(temp), data = df)
}

pred_gam <- function(mod) {
  predict.gam(mod, newdata = data.frame(temp = newdata_values))
}

dat2 <- dat2 %>% mutate(model = map(data, lat_gam))

dat2 <- dat2 %>% mutate(predicted_value = map(model, pred_gam))


dat2 %>% select(-data, -model) %>% unnest(cols = c(newdata_value, predicted_value))

The last line is totally optional, just gets the final output to print like the way you specified in 3)

This is really neat. Is it possible to add multiple arguments to the functions so that the column names of `dat` don't have to be the same every time, and can just be specified in `lat_gam` and `pred_gam`? I tried adding them to the functions and then to the `mutate(newcol = map())` calls, but I don't think `map` works with multiple inputs. — AFH, Oct 24 '19 at 13:27

score 1 · Answer 3 · answered Oct 30 '19 at 10:36

1

Here's a data.table approach:

library(data.table)
library(mgcv)

setDT(dat)

dat[, .(pred = c(predict.gam(gam(lat ~ temp), list(temp = newdata_values))),
        newdata_values),
    by = years]

The only problem I had was that the predict.gam(...) call returns an array. The c(predict.gam(...)) converts it to an array.

A similar base approach that does not have perfect formatting:

by(dat[, -1],
   dat[, 1],
   function(DF) {
     mod = gam(lat ~ temp, data = DF)
     pred = predict.gam(mod, list(temp = newdata_values))

     data.frame(newdata_values, pred)
     }
   )

answered Oct 30 '19 at 10:36

Cole

11,130
1
9
24

I really like the `data.table` approach but when I run that script I get a dataframe that I think just has the original data: `> head(dat) years lat temp 1: 1968 34.5 18.730418 2: 1969 34.5 14.890535 3: 1970 34.5 14.475564 4: 1971 34.5 7.380579 5: 1972 34.5 12.090908 6: 1973 34.5 18.565517` – AFH Oct 30 '19 at 14:20
@AFH that's because the ```data.table``` call above doesn't modify the original ```dat```. You would have to do a new assignment, i.e., ```mod_res <- dat[, .(pred = ...), by = years]``` – Cole Oct 31 '19 at 00:02

predict values from GAM for grouped dataframe in R

3 Answers3