I am looking to geocode quickly a large number of addresses using Google's API accessed through ggmap
(for which I am using ggmap
version 2.7 which allows to specify a google API key through register_google()
function).
The basic functionality works but is very slow when using dplyr
in the following manner:
devtools::install_github("dkahle/ggmap") #needed for v 2.7, not yet on CRAN
library(dplyr)
library(ggmap)
DF %>%
select(ID, full_address) %>%
mutate_geocode(full_address)
I am getting at best 1-3 queries per second, which for a few thousand records will not be fast enough for my use. I would like to use multidplyr to run parallel queries and speed up the process.
I have seen a multidplyr example on SO here but I cannot seem to implement it for my case.
The solution I imagine could use multidplyr::partition()
and either use:
- the
ggmap::mutate_geocode()
function; or - the
ggmap::geocode()
function which returns a pair of columns (longitude and latitude) together withdplyr::do()
function.
Based on the SO example linked above, I believe the code could look something like this:
library(dplyr)
library(ggmap)
devtools::install_github("hadley/multidplyr")
library(multidplyr)
DF$split <- 1:4 # the column by which data will be split across 4 threads
# For solution (1)
DF %>%
select(ID, full_address) %>%
partition(split) %>%
mutate_geocode(full_address)
# For solution (2)
DF %>%
select(ID, full_address) %>%
partition(split) %>%
do(
{ggmap::geo_code(full_address)}
)
Can anyone help me get this in the correct shape? Any recommendations to make this even faster?