0

I am looking to geocode quickly a large number of addresses using Google's API accessed through ggmap (for which I am using ggmap version 2.7 which allows to specify a google API key through register_google() function).

The basic functionality works but is very slow when using dplyr in the following manner:

devtools::install_github("dkahle/ggmap") #needed for v 2.7, not yet on CRAN

library(dplyr)
library(ggmap)
DF %>%
  select(ID, full_address) %>%
  mutate_geocode(full_address)

I am getting at best 1-3 queries per second, which for a few thousand records will not be fast enough for my use. I would like to use multidplyr to run parallel queries and speed up the process.

I have seen a multidplyr example on SO here but I cannot seem to implement it for my case.

The solution I imagine could use multidplyr::partition() and either use:

  1. the ggmap::mutate_geocode() function; or
  2. the ggmap::geocode() function which returns a pair of columns (longitude and latitude) together with dplyr::do() function.

Based on the SO example linked above, I believe the code could look something like this:

  library(dplyr)
  library(ggmap)
  devtools::install_github("hadley/multidplyr")
  library(multidplyr)

  DF$split <- 1:4 # the column by which data will be split across 4 threads

  # For solution (1)
  DF %>%
  select(ID, full_address) %>%
  partition(split) %>%
  mutate_geocode(full_address)


  # For solution (2)
  DF %>%
  select(ID, full_address) %>%
  partition(split) %>%
  do(
  {ggmap::geo_code(full_address)}
     )

Can anyone help me get this in the correct shape? Any recommendations to make this even faster?

Community
  • 1
  • 1
Olivier
  • 378
  • 2
  • 10
  • (1) You need to either load ggmap on the cluster, or use `ggmap::mutate_geocode`, but either way it would depend on there being a method for `party_df` (which probably isn't there). (2) Within `do` you need to refer to variables as `.$variable`, not just `variable`. – Axeman Dec 13 '16 at 11:39
  • @Axeman, thanks for the quick feedback, I couldn't find a way for solution type (1) to work and I don't believe it can handle `party_df` as you had already identified. I did however work on (2) and it works indeed with your suggestion. I have also added a `%>% collect()` at the end to repatriate the answers from the cores. I am finding some modest speed improvements, but it seems that the Google API error rate increases equally rapidly when I ramp-up the number of parallel queries (i.e. I get NA) - not sure if this is because of the network, or some other API issue – Olivier Dec 13 '16 at 16:06
  • There's probably a cap on how many API calls are allowed. – Axeman Dec 13 '16 at 18:24
  • @Axeman - yes, 2,500 API calls in a day, unless you pay for extra – SymbolixAU Dec 13 '16 at 20:21

0 Answers0