-1

I have a rather badly formatted list of locations. I need to extract the names of the cities and countries for each entry. I am not sure how to proceed.

The list looks like:

c("Groningen", "Netherlands, Groningen", "Netherlands", "Jerusalem, Israel",
 "Nesher, Israel" "Western, United States", "U.S.", "United States",
 "Sacramento, California, USA")

Thanks, TK

Lstat
  • 1,450
  • 1
  • 12
  • 18
  • 2
    What have you tried so far? How can you extract the country name if it is not there? Do you have 100, 1000 or 10 million observations? – desval Apr 13 '20 at 12:46

1 Answers1

2

Ideally you would have to try to find out whether there is some package that lets you search on google maps.

If there isnt any, I would start by splitting the data, match the country names with the countrycode package, and move from there.

library("countrycode")
library("data.table")

d <- data.table(raw = c("Groningen", "Netherlands, Groningen", "Netherlands", "Jerusalem, Israel",
  "Nesher, Israel", "Western, United States", "U.S.", "United States","Sacramento, California, USA"))

d <- cbind(
  d,
  d[, tstrsplit(raw, ",", fixed=TRUE) ]
)

d[, country := countrycode( V1, "country.name", "country.name")]
d[!is.na(country), city := V2]
d[is.na(country), city := V1]
d[is.na(country), country := countrycode( V2, "country.name", "country.name")]

                           raw            V1             V2   V3       country       city
1:                   Groningen     Groningen           <NA> <NA>          <NA>  Groningen
2:      Netherlands, Groningen   Netherlands      Groningen <NA>   Netherlands  Groningen
3:                 Netherlands   Netherlands           <NA> <NA>   Netherlands       <NA>
4:           Jerusalem, Israel     Jerusalem         Israel <NA>        Israel  Jerusalem
5:              Nesher, Israel        Nesher         Israel <NA>        Israel     Nesher
6:      Western, United States       Western  United States <NA> United States    Western
7:                        U.S.          U.S.           <NA> <NA> United States       <NA>
8:               United States United States           <NA> <NA> United States       <NA>
9: Sacramento, California, USA    Sacramento     California  USA          <NA> Sacramento
desval
  • 2,345
  • 2
  • 16
  • 23