geocode adresses in R: grep() lookup world.cities dataset

Question

I am trying to geocode some addresses stored in character vectors. I've used the geocode() function in ggmaps; however, it only classified about 50% of my addresses. I was hoping to use a more rudimentary approach looking up whether the city name (from the world.cities data in the maps package is in my list of addresses, and if so, taking the longitude and latitude info from this lookup table. I'll try to clean up the returned file, and complement it with the other geocoding approaches (calls to various external API's) that R provides. What I've coded thus far is below:

places <- c("Atlanta,Georgia", "My house, Paris, France", "Some Other House, Paris, Ontario, Canada", "Paris", "Oxford", "Oxford, USA")

library(maps)
data(world.cities)
ddd <- world.cities[world.cities$name %in% c("Paris","Oxford","New York"),]

is.integer0 <- function(x) {
is.integer(x) && length(x) == 0L
}

for (i in 1:length(places)) {
  for (j in 1:dim(ddd)[1]) {
  k <- ddd$name[j]
    if (is.integer0(grep(k,places[i],perl=TRUE))==TRUE) next
      if (exists("zzz")==FALSE) {
        zzz <- cbind(places[i],ddd[j,1:5])
      } else {
        zzz <- rbind(zzz,cbind(places[i],ddd[j,1:5])) 
      } 
  }
}

The output is what I want (I will subjectively clean it later). My problem is that my real data is about 8000 addresses and the world.cities data is about 40000+ cities, so the double for loop approach is a bit slow. As with other tasks in R I suppose this could be vectorized with some member of the apply family. I'm having trouble wrapping my head on how to do it. Any thoughts?

### Output
                                      places[i]   name country.etc     pop    lat  long
28245                   My house, Paris, France  Paris      Canada   10570  43.20  0.38
28246                   My house, Paris, France  Paris      France 2141839  48.86   2.34
282451 Some Other House, Paris, Ontario, Canada  Paris      Canada   10570  43.20 -80.38
282461 Some Other House, Paris, Ontario, Canada  Paris      France 2141839  48.86   2.34
282452                                    Paris  Paris      Canada   10570  43.20 -80.38
282462                                    Paris  Paris      France 2141839  48.86   2.34
27671                                    Oxford Oxford      Canada    1271  45.73 -63.87
27672                                    Oxford Oxford New Zealand    1816 -43.30 172.18
27673                                    Oxford Oxford          UK  157568  51.76  -1.26
276711                              Oxford, USA Oxford      Canada    1271  45.73 -63.87
276721                              Oxford, USA Oxford New Zealand    1816 -43.30 172.18
276731                              Oxford, USA Oxford          UK  157568  51.76  -1.26

After some further data cleaning I would really want:

### Output
                                      places[i]   name country.etc     pop    lat  long
 28246                   My house, Paris, France  Paris      France 2141839  48.86   2.34
282451 Some Other House, Paris, Ontario, Canada  Paris      Canada   10570  43.20 -80.38
282462                                    Paris  Paris      France 2141839  48.86   2.34
27673                                    Oxford Oxford          UK  157568  51.76  -1.26
276731                              Oxford, USA Oxford          NA       NA    NA  NA
                               Atlanta, Georgia     NA          NA       NA    NA  NA

Basically, the logic is:

If country also matches places string the keep that row. Paris, France and Paris, Canada examples.
If places string contains a single word, then guess they are referring to city with largest population. So default Paris to Paris, France and Oxford to Oxford UK. Since it is hard to geocode a non-unique address.
If places string contains more than a single word but Country does not match any of these other words, like Oxford, USA. Then make everything except city NA. Here I will try my luck with geocode() and other services to get better information.
If the places address was never in the lookup dictionary add it, and then try to fill in everything (really I only want long/lat) using geocode() etc. Thats the Atlanta Georgia example.

Thoughts on the approach in general and how to do better in R? As mentioned above the impetus for this approach was to see if I could complement what I already get (50% geocoded addresses with the geocode() function)

score 2 · Accepted Answer · answered Sep 12 '14 at 19:04

This makes the city extraction more generic (uses string regex matching) and then does a merge with the world.cities data:

places_dat <- cbind(places, Reduce(rbind, 
                lapply(str_match_all(places, ",*\ *([[:alpha:]]+)\ *,\ *([[:alpha:]]+)\ *$"),
                  function(x) {

  if (length(x) == 0) {
    return(data.frame(city=NA, state=NA))
  } else {
    return(data.frame(city=x[,2], state=x[,3]))
  }

})))

places_dat

##                                     places    city   state
## 1                          Atlanta,Georgia Atlanta Georgia
## 2                  My house, Paris, France   Paris  France
## 3 Some Other House, Paris, Ontario, Canada Ontario  Canada
## 4                                    Paris    <NA>    <NA>
## 5                                   Oxford    <NA>    <NA>
## 6                              Oxford, USA  Oxford     USA
## 

merge(places_dat, world.cities, by.x="city", by.y="name", all.x=TRUE)

##      city                                   places   state country.etc     pop    lat    long capital
## 1 Atlanta                          Atlanta,Georgia Georgia         USA  424096  33.76  -84.42       0
## 2   Paris                  My house, Paris, France  France      France 2141839  48.86    2.34       1
## 3   Paris                  My house, Paris, France  France      Canada   10570  43.20  -80.38       0
## 4 Ontario Some Other House, Paris, Ontario, Canada  Canada         USA  175805  34.05 -117.61       0
## 5  Oxford                              Oxford, USA     USA      Canada    1271  45.73  -63.87       0
## 6  Oxford                              Oxford, USA     USA New Zealand    1816 -43.30  172.18       0
## 7  Oxford                              Oxford, USA     USA          UK  157568  51.76   -1.26       0
## 8    <NA>                                    Paris    <NA>        <NA>      NA     NA      NA      NA
## 9    <NA>                                   Oxford    <NA>        <NA>      NA     NA      NA      NA

It still requires some sifting (perhaps complete.cases as one step) but it gets you further and should be a bit faster.

The `lapply()` example and the regular expression pre-cleaning are helpful hints!! Thanks!! — Chris, Sep 12 '14 at 21:10

geocode adresses in R: grep() lookup world.cities dataset

1 Answers1