0

I'm running linear regression on COVID-19 data over all 3000+ US counties and the code is running pretty slow. Are there options to parallelize this?

I've tried furrr::future_map() but it doesn't really speed up the process that much. CPU usage is around 26% with and without furrr:future_map and only one process is running.

Example code:

library(furrr)
future::plan(multisession, workers = 6)
# also tried multisession workers = 6, (runtime 8.5 min)
# also tried multicore workers = 6, (runtime 3.5 min)
# also tried multicore w/ default workers, (runtime 5.5 min)

# the other called regression functions look very similar
casesmodel <- function(tbl) {
        lm(casesper100k ~ time, data = tbl)
}

uscases_twoweeks <-
    casesdeaths %>%
        filter(date >= twoweeksago) %>%
        filter(!is.na(population)) %>%
        filter(population > min_country_population) %>%
        mutate(countyid = paste(county, state, sep = ", ")) %>%
        arrange(countyid, date) %>%
        group_by(countyid) %>%
        nest() %>%
        mutate(deathmodel = future_map(data, deathsmodel),
               casemodel = future_map(data, casesmodel),
               absdeathmodel = future_map(data, absdeathsmodel),
               abscasemodel = future_map(data, abscasesmodel),
               )

HenrikB
  • 6,132
  • 31
  • 34
Rob Hanssen
  • 153
  • 8
  • Building a linear model is very fast. If you want to parallelize this it might even get slower due to the overhead... maybe you can use `SparkR`? – danlooo Oct 04 '21 at 14:22
  • I'm not sure I read that tidyverse gibberish correctly, but it appears like you do a regression for each country? Try using `nlme::lmList` instead. – Roland Oct 04 '21 at 14:24
  • 1
    Why do you not want a combined model with county as a fixed or random factor? – danlooo Oct 04 '21 at 14:30
  • there are also some faster linear model implementations (`fastglm` package, `RcppEigen::fastLm` ...) – Ben Bolker Oct 04 '21 at 15:05
  • If you only need coefficients then if y is the dependent variable and x is the independent variable and g is a factor defining the groups then try flm (fast lm): `library(collapse); flm(y, model.matrix(~ g/x-1))` – G. Grothendieck Oct 04 '21 at 15:56

0 Answers0