0

This question may look like a duplicate but I am facing some issue while extracting country names from the string. I have gone through this link [link]Extracting Country Name from Author Affiliations but I was not able to solve my problem.I have tried grepl and for loop for text matching and replacement, my data column consists of more than 300k rows so using grepl and for loop for pattern matching is very very slow.

I have a column like this.

org_loc

Zug
Zug  Canton of Zug
Zimbabwe
Zigong
Zhuhai
Zaragoza 
York  United Kingdom
Delhi
Yalleroi  Queensland
Waterloo  Ontario
Waterloo  ON 
Washington  D.C.
Washington D.C. Metro 
New York


df$org_loc <- c("zug", "zug  canton of zug", "zimbabwe", 
"zigong", "zhuhai", "zaragoza","York  United Kingdom", "Delhi","Yalleroi  Queensland","Waterloo  Ontario","Waterloo  ON","Washington  D.C.","Washington D.C. Metro","New York")

the string may contain the name of a state, city or country. I just want Country as output. Like this

org_loc

Switzerland
Switzerland
Zimbabwe
China
China
Spain
United Kingdom
India
Australia
Canada
Canada
United State
United state
United state

I am trying to convert state (if match found) to its country using countrycode library but not able to do so. Any help would be appreciable.

girijesh96
  • 455
  • 1
  • 4
  • 16

3 Answers3

2

You can use your City_and_province_list.csv as a custom dictionary for countrycode. The custom dictionary can not have duplicates in the origin vector (the City column in your City_and_province_list.csv), so you'll have to remove them or deal with them somehow first (as in my example below). Currently, you don't have all of the possible strings in your example in your lookup CSV, so they are not all converted, but if you added all of the possible strings to the CSV, it would work completely.

library(countrycode)

org_loc <- c("Zug", "Zug  Canton of Zug", "Zimbabwe", "Zigong", "Zhuhai",
             "Zaragoza", "York  United Kingdom", "Delhi",
             "Yalleroi  Queensland", "Waterloo  Ontario", "Waterloo  ON",
             "Washington  D.C.", "Washington D.C. Metro", "New York")
df <- data.frame(org_loc)

city_country <- read.csv("https://raw.githubusercontent.com/girijesh18/dataset/master/City_and_province_list.csv")

# custom_dict for countrycode cannot have duplicate origin codes
city_country <- city_country[!duplicated(city_country$City), ]

df$country <- countrycode(df$org_loc, "City", "Country", 
                          custom_dict = city_country)

df
# org_loc                  country
# 1                    Zug              Switzerland
# 2     Zug  Canton of Zug                     <NA>
# 3               Zimbabwe                     <NA>
# 4                 Zigong                    China
# 5                 Zhuhai                    China
# 6               Zaragoza                    Spain
# 7   York  United Kingdom                     <NA>
# 8                  Delhi                    India
# 9   Yalleroi  Queensland                     <NA>
# 10     Waterloo  Ontario                     <NA>
# 11          Waterloo  ON                     <NA>
# 12      Washington  D.C.                     <NA>
# 13 Washington D.C. Metro                     <NA>
# 14              New York United States of America
CJ Yetman
  • 8,373
  • 2
  • 24
  • 56
  • thank you for your response. I am able to achieve this answer using match function on the same csv but I would like to bring to your concern that if I have to text string like Washington D.C, Waterloo ON, Yalleroi Queensland. Metro than it is giving NA. How can I overcome this issue using countrycode library – girijesh96 Mar 27 '18 at 14:47
  • If you had the string "Washington D.C, Waterloo ON, Yalleroi Queensland", what would you expect `countrycode` to convert that to? – CJ Yetman Mar 27 '18 at 18:09
  • If you want "Yalleroi Queensland" to be converted to "Australia", you would add "Yalleroi Queensland" to your `City_and_province_list.csv` in the `org_loc` with "Australia" in the `country` column on the same line. – CJ Yetman Mar 27 '18 at 19:16
  • ok, Thank you for your answer But I think that will be a cumbersome process to add all such in City_and_province_list.csv because my original dataset has a large number of such variations in names – girijesh96 Mar 28 '18 at 12:07
  • So your question is not about **how** to do this, but **where** to find a dataset pre-made with all of your data in it? I can’t help with that. – CJ Yetman Mar 28 '18 at 12:56
0
library(countrycode)
df <- c("zug  switzerland", "zug  canton of zug  switzerland", "zimbabwe", 
            "zigong  chengdu  pr china", "zhuhai  guangdong  china", "zaragoza","York  United Kingdom", "Yamunanagar","Yalleroi  Queensland  Australia","Waterloo  Ontario","Waterloo  ON","Washington  D.C.","Washington D.C. Metro","USA")
df1 <- countrycode(df, 'country.name', 'country.name')

It didn't match a lot of them, but that should do what you're looking for, based on the reference manual for countrycode.

Anonymous coward
  • 2,061
  • 1
  • 16
  • 29
  • I have also used the same code for the country but I want country name corresponding to the city. for eg for London I want united kingdom – girijesh96 Mar 22 '18 at 18:47
  • The `map` library's `world.cities` might be of help. See [this SO question](https://stackoverflow.com/questions/46135055/how-do-i-convert-city-names-to-time-zones). – Anonymous coward Mar 22 '18 at 19:10
0

With function geocode from package ggmap you may accomplish, with good but not total accuracy your task; you must also use your criterion to say "Zaragoza" is a city in Spain (which is what geocode returns) and not somewhere in Argentina; geocode tends to give you the biggest city when there are several homonyms. (remove the $country to see all of the output)

library(ggmap)
org_loc <- c("zug", "zug  canton of zug", "zimbabwe", 
                "zigong", "zhuhai", "zaragoza","York  United Kingdom", 
             "Delhi","Yalleroi  Queensland","Waterloo  Ontario","Waterloo  ON","Washington  D.C.","Washington D.C. Metro","New York")
    geocode(org_loc, output = "more")$country

as geocode is provided by google, it has a query limit, 2,500 per day per IP address; if it returns NAs it may be because an unconsistent limit check, just try it again.

Elio Diaz
  • 566
  • 2
  • 19
  • I have edited the question a bit. I have accomplished country name but I want a country name for the city. Can you please look for it – girijesh96 Mar 22 '18 at 18:52
  • thankyou for your help it is working suboptimally but I have almost 300k values like this. I would like to ask one thing can we do it with key-value pair if we have csv of all city and their country. Here is [link] https://github.com/girijesh18/dataset of csv file and sample file. I have tried to map using both files with for loop and grepl function but it is causing overwriting of values.Can you help me in building a solution for the city to country mapping using both csv files? – girijesh96 Mar 22 '18 at 19:48
  • you may just concatenate both your data and your target csv with paste(city, country), that would yield a "zaragoza spain" in one single column, then dplyr::left_join(); that's very fast with matching records that fully match. Your csv doesn't have all the cities in the world – Elio Diaz Mar 22 '18 at 19:57
  • city_and_province_list contains all city name present in this [link]github.com/girijesh18/dataset. but how to apply it using left join. can you give a brief about that – girijesh96 Mar 23 '18 at 02:12