2

I am running a parallel computation using furrr in R. The computation require access to a web API and an authentication needs to take place. If I run a parallel process, each process needs to authenticate. In the below, I have 6 processes. So I would need to authenticate on these six processes first then run the calculations. I don't know how to do that using furrr. So I end up doing an authentication in each run, which is really inefficient.

Below is a simple example for illustrative purposes. It does not work because I can't share the api.configure function, but hopefully you get the idea.

Thanks

library(tidyverse)
library(furrr)
plan(multiprocess, workers = 6)

testdf =  starwars %>%
  select(-films, -vehicles, -starships) %>%
  future_pmap_dfr(.f = function(...){
    api.configure(username = "username", password = "password")
    currentrow = tibble(...)
    l = tibble(name = currentrow$name, height = currentrow$height)
    return(l)
})
Courvoisier
  • 904
  • 12
  • 26
  • Could the API connection be kept as [global](https://rdrr.io/cran/furrr/man/future_options.html)? – Waldi Sep 29 '20 at 09:06
  • I don't know. is there something I can do for that or is this an API specific thing? – Courvoisier Sep 29 '20 at 09:17
  • if api.configure returns a connection object, you could pass this connection object as global variable. – Waldi Sep 29 '20 at 09:23
  • The api does not return an explicit connection object. it uses env to store token and other parameters. So I am now looking if I can use the api::api.env object in global – Courvoisier Sep 29 '20 at 09:34
  • I did `future_options(globals = "api:::api.env")` but that failed giving: `Error in api.authenticate() : Missing required parameter: username` – Courvoisier Sep 29 '20 at 09:37
  • See if my answer works : if not I'll delete it – Waldi Sep 29 '20 at 09:40
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/222229/discussion-between-courvoisier-and-waldi). – Courvoisier Sep 29 '20 at 10:29

2 Answers2

0

Try to open the connexion before the map:

library(tidyverse)
library(furrr)
plan(multiprocess, workers = 6)

future_options(globals = T) # this should be the default
api.configure(username = "username", password = "password")
ls(all=TRUE) #Check if new environment variables are available to save connexion

testdf =  starwars %>%
  select(-films, -vehicles, -starships) %>%
  future_pmap_dfr(.f = function(...){
    
    currentrow = tibble(...)
    l = tibble(name = currentrow$name, height = currentrow$height)
    return(l)
})
Waldi
  • 39,242
  • 6
  • 30
  • 78
  • the environment variable `api:::api.env` is not in the list. Furthermore, we changed the API to output the hidden env variable, it didn't change anything. – Courvoisier Sep 29 '20 at 09:53
  • How does the API store its connexion? For a DB it would be for example `conn <- dbConnect(user,pwd,...)`. future_map works well with a single db connexion and multiple tasks. So the connexion object is the key, making multiple connexions isn't always mandatory. – Waldi Sep 29 '20 at 09:55
  • The API stores the connection in the package environment variable, which is a hidden variable. Also, all API calls do not have an explicit connection object as input. – Courvoisier Sep 29 '20 at 10:22
0

The way to solve this was to ask the dev of the API to add variable in the API package that tests whether the connection is open or not. this way I authenticate once on each of the future processes, if the connection is not open, and once this is done, all subsequent API authentication calls to that process will be halted by the if clause.

Courvoisier
  • 904
  • 12
  • 26