I am trying to geocode some addresses stored in character vectors. I've used the geocode()
function in ggmaps
; however, it only classified about 50% of my addresses. I was hoping to use a more rudimentary approach looking up whether the city name (from the world.cities
data in the maps
package is in my list of addresses, and if so, taking the longitude and latitude info from this lookup table. I'll try to clean up the returned file, and complement it with the other geocoding approaches (calls to various external API's) that R provides. What I've coded thus far is below:
places <- c("Atlanta,Georgia", "My house, Paris, France", "Some Other House, Paris, Ontario, Canada", "Paris", "Oxford", "Oxford, USA")
library(maps)
data(world.cities)
ddd <- world.cities[world.cities$name %in% c("Paris","Oxford","New York"),]
is.integer0 <- function(x) {
is.integer(x) && length(x) == 0L
}
for (i in 1:length(places)) {
for (j in 1:dim(ddd)[1]) {
k <- ddd$name[j]
if (is.integer0(grep(k,places[i],perl=TRUE))==TRUE) next
if (exists("zzz")==FALSE) {
zzz <- cbind(places[i],ddd[j,1:5])
} else {
zzz <- rbind(zzz,cbind(places[i],ddd[j,1:5]))
}
}
}
The output is what I want (I will subjectively clean it later). My problem is that my real data is about 8000 addresses and the world.cities
data is about 40000+ cities, so the double for loop approach is a bit slow. As with other tasks in R I suppose this could be vectorized with some member of the apply family. I'm having trouble wrapping my head on how to do it. Any thoughts?
### Output
places[i] name country.etc pop lat long
28245 My house, Paris, France Paris Canada 10570 43.20 0.38
28246 My house, Paris, France Paris France 2141839 48.86 2.34
282451 Some Other House, Paris, Ontario, Canada Paris Canada 10570 43.20 -80.38
282461 Some Other House, Paris, Ontario, Canada Paris France 2141839 48.86 2.34
282452 Paris Paris Canada 10570 43.20 -80.38
282462 Paris Paris France 2141839 48.86 2.34
27671 Oxford Oxford Canada 1271 45.73 -63.87
27672 Oxford Oxford New Zealand 1816 -43.30 172.18
27673 Oxford Oxford UK 157568 51.76 -1.26
276711 Oxford, USA Oxford Canada 1271 45.73 -63.87
276721 Oxford, USA Oxford New Zealand 1816 -43.30 172.18
276731 Oxford, USA Oxford UK 157568 51.76 -1.26
After some further data cleaning I would really want:
### Output
places[i] name country.etc pop lat long
28246 My house, Paris, France Paris France 2141839 48.86 2.34
282451 Some Other House, Paris, Ontario, Canada Paris Canada 10570 43.20 -80.38
282462 Paris Paris France 2141839 48.86 2.34
27673 Oxford Oxford UK 157568 51.76 -1.26
276731 Oxford, USA Oxford NA NA NA NA
Atlanta, Georgia NA NA NA NA NA
Basically, the logic is:
- If country also matches places string the keep that row. Paris, France and Paris, Canada examples.
- If places string contains a single word, then guess they are referring to city with largest population. So default Paris to Paris, France and Oxford to Oxford UK. Since it is hard to geocode a non-unique address.
- If places string contains more than a single word but Country does not match any of these other words, like Oxford, USA. Then make everything except city NA. Here I will try my luck with
geocode()
and other services to get better information. - If the places address was never in the lookup dictionary add it, and then try to fill in everything (really I only want long/lat) using
geocode()
etc. Thats the Atlanta Georgia example.
Thoughts on the approach in general and how to do better in R? As mentioned above the impetus for this approach was to see if I could complement what I already get (50% geocoded addresses with the geocode()
function)