1

I have a numeric, a count, and an over-dispersed count large matrices:

set.seed(1)
numeric.mat <- matrix(rnorm(10000*6000),10000,6000)
count.mat <- matrix(rpois(10000*6000,10),10000,6000)
dispersed.count.mat <- matrix(rnegbin(10000*6000,10,2),10000,6000)

And one corresponding factors data.frame (can be a matrix too):

factors.df <- data.frame(f1 = sample(LETTERS[1:3], 10000, replace = T), 
                         f2 = sample(LETTERS[4:5], 10000, replace = T))

The number of factors is pretty small (in this case only 2 but won't be more than 5 for real data), and the number of levels in each (they're all categorical) is also small (also up to 5).

I'd like to obtain the residuals for fitting a linear, poisson, and negative binomial regression models to each of the columns in each of the matrices, respectively.

So for a single column:

data.df <- factors.df %>% 
    dplyr::mutate(numeric.y = numeric.mat[,1], 
                  count.y = count.mat[,1], 
                  dispersed.count.y = dispersed.count.mat[,1])

I'd use:

lm(numeric.y ~ f1+f2, data = data.df)$residuals
residuals(object = glm(count.y ~ f1+f2, data = data.df, family = "poisson"), type = 'pearson')
residuals(object = glm.nb(formula = model.formula, data = regression.df), type = 'pearson')

For the three regression models.

Is there a faster way of obtaining these residuals other than, for example, using do.call, for each. E.g.:

do.call(cbind, 
        lapply(1:ncol(numeric.mat), 
               function(i)
                   lm(numeric.y ~ f1+f2, 
                      data = dplyr::mutate(factors.df, 
                                           numeric.y = numeric.mat[,i])
                   )$residuals
))
alistaire
  • 42,459
  • 4
  • 77
  • 117
dan
  • 6,048
  • 10
  • 57
  • 125

1 Answers1

3

I'd slightly readjust how the workflow runs and allow it to be easily run in parallel.

# Use variables to adjust models, makes it easier to change sizes
iter <- 60
iter_samps <- 1000

factors_df <- data.frame(f1 = sample(LETTERS[1:3], iter_samps, replace = T), 
                         f2 = sample(LETTERS[4:5], iter_samps, replace = T)) 

# using a data.frame in a longer format to hold the data, allows easier splitting
data_df <- rep(list(factors_df), iter) %>% 
  bind_rows(.id = "id") %>%
  mutate(numeric_y = rnorm(iter_samps * iter),
         count_y = rpois(iter_samps * iter, 10),
         dispersed_count_y = MASS::rnegbin(iter_samps * iter, 10, 2))

# creating function that determines residuals
model_residuals <- function(data) {
  data$lm_resid <- lm(numeric_y ~ f1+f2, data = data)$residuals
  data$glm_resid <- residuals(object = glm(count_y ~ f1+f2, data = data, family = "poisson"), type = 'pearson')
  return(data)
}
# How to run the models not in parallel 
data_df %>%
  split(.$id) %>%
  map(model_residuals) %>%
  bind_rows()

To run the models in parallel you can use multidplyr to do all the annoying work

library("multidplyr")
test = data_df %>%
  partition(id) %>%
  cluster_library("tidyverse") %>%
  cluster_library("MASS") %>%
  cluster_assign_value("model_residuals", model_residuals) %>%
  do(results = model_residuals(.)) %>%
  collect() %>%
  .$results %>%
  bind_rows()
zacdav
  • 4,603
  • 2
  • 16
  • 37
  • Thanks a lot @zacdav. Suppose I wanted to call model_residuals with the model's formula and regression type as arguments in addition to data. i.e.: model_residuals(data, model.formula, regression.type)), to make it generic wrt model formula and run only the selected regression type (e.g.., only run poisson regression). Can you add the syntax for that? I've tried adding these args after the "." in the parallel implementation (do(results = model_residuals(df=., model.formula = as.formula("y~f1+f2"),regression.type="poisson"))) and am getting an error that it doesn't recognize the args. – dan Nov 08 '17 at 07:20