0

I have this dataset which consists of location in the following format:

1. city, province/state, country
2. city
3. province/state
4. country

Now, much of the dataset is cleaned using the following method, and I only took the ones that doesn't have the clean set.

## first, I tried to clean by removing some texts that are not helpful

df <- df%>% 
  add_column(location_clean = trimws(str_replace_all(.$location, "Greater|Metropolitan|Metro|Area", "")), 
             .after = "location")


## separate the location into locale, area, and country. locale will be used to gather the country names in the next step
df <- df%>% 
  separate(location_clean, 
           c("locale", "area1","area2"), sep=",", 
           remove = FALSE, extra="merge", fill="right") %>%
  mutate(locale = trimws(locale), area1 = trimws(area1), area2 = trimws(area2))



    df <- structure(list(location = c("United States", "Israel", "Greater Stockholm Metropolitan Area", 
"Greater Chicago Area", "United States", "Greater Minneapolis-St. Paul Area", 
"Greater Tampa Bay Area", "Greater Sacramento", "United States", 
"Atlanta Metropolitan Area", "Kansas City Metropolitan Area", 
"Kansas City Metropolitan Area", "Charlotte Metro", "San Diego Metropolitan Area", 
"Greater Hamburg Area", "United Arab Emirates", "Greater Stockholm Metropolitan Area", 
"Raleigh-Durham-Chapel Hill Area", "Slovakia", "Latvia", "Greater Sydney Area", 
"Sweden", "Romania", "Canada", "United States", "Metro Jacksonville", 
"Metro Jacksonville", "Greater Tampa Bay Area", "Singapore", 
"Greater Bolzano Metropolitan Area", "Greater Scranton Area"), 
    locale = c("United States", "Israel", "Stockholm", "Chicago", 
    "United States", "Minneapolis-St. Paul", "Tampa Bay", "Sacramento", 
    "United States", "Atlanta", "Kansas City", "Kansas City", 
    "Charlotte", "San Diego", "Hamburg", "United Arab Emirates", 
    "Stockholm", "Raleigh-Durham-Chapel Hill", "Slovakia", "Latvia", 
    "Sydney", "Sweden", "Romania", "Canada", "United States", 
    "Jacksonville", "Jacksonville", "Tampa Bay", "Singapore", 
    "Bolzano", "Scranton"), area1 = c(NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_), area2 = c(NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_
    ), country = c(NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_
    )), row.names = c(NA, -31L), class = c("tbl_df", "tbl", "data.frame"
))

The df now consists of the original location and locale separated from the location column. I'm lost at how to get the country name from the city and country name in the locale column (or location column).

I have tried using world.cities from the maps package, but something is not working correctly when I mutate the country column to get the country name from city names in the locale column.

For example: when I run the following code, I get inconsistent results.

get_country_name <- function(city){
  
  w <- world.cities %>% 
    filter(name == city) %>% 
    arrange(desc(pop)) %>% head(1) %>% pull(country.etc)
  
  return(w)
}

df %>% 
  mutate(country = ifelse(is.na(country), unlist(sapply(locale, get_country_name)), country))

Another issue is that there may be multiple countries with the same city name. One approach I'm taking is to select the country by the population, essentially, making an assumption that only people from big cities are in the database. If there's a better way to do this, that'd be great.

Output

The expected result is the dataframe with the correct country name filled.

df <- structure(list(location = c("United States", "Israel", "Greater Stockholm Metropolitan Area", 
"Greater Chicago Area", "United States", "Greater Minneapolis-St. Paul Area", 
"Greater Tampa Bay Area", "Greater Sacramento", "United States", 
"Atlanta Metropolitan Area", "Kansas City Metropolitan Area", 
"Kansas City Metropolitan Area", "Charlotte Metro", "San Diego Metropolitan Area", 
"Greater Hamburg Area", "United Arab Emirates"), locale = c("United States", 
"Israel", "Stockholm", "Chicago", "United States", "Minneapolis-St. Paul", 
"Tampa Bay", "Sacramento", "United States", "Atlanta", "Kansas City", 
"Kansas City", "Charlotte", "San Diego", "Hamburg", "United Arab Emirates"
), area1 = c(NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_), area2 = c(NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_), 
    country = c("United States", "Israel", "Sweden", 
    "United States", "United States", "United States", "United States", 
    "United States", "United States", "United States", "United States", 
    "United States", "United States", "United States", "Germany", 
    "United Arab Emirates")), row.names = c(NA, -16L), class = c("tbl_df", 
"tbl", "data.frame"))
user1828605
  • 1,723
  • 1
  • 24
  • 63
  • 2
    Can you please show an expected output data frame. Not sure whats expected when location and locale is e.g. "United States". – Andre Wildberg Mar 03 '23 at 14:32
  • @AndreWildberg, `location` is the data from the original dataset. `locale` is the one created after using `dplyr::separate` function. I acknowledge that the data is really messy, and that's what I needed to fix. Regardless of what data it has in locale, the country column needs to filled with the actual country name that is associated with the value in `locale`. E.g., if locale is "United States", the country should be "United States", for "Paris", country is "France" and so on. That's why I took one step at a time - first city, then country, etc. – user1828605 Mar 03 '23 at 16:09
  • Maybe this is what you need https://stackoverflow.com/questions/49435847/extracting-country-name-from-city-name-in-r . – Andre Wildberg Mar 03 '23 at 16:45
  • @AndreWildberg, I have already tried that. It works to some extent. The issue was that there was no way for me to mention the "city name" in the `countrycode` package, so I had to use "country.name" in origin param. When it encounters a city name - like "Hamburg", it inserts "Hamburg" as the country name when it should've been "Germany". I'd really appreciate it if you could point me to a place which shows how to use the package with city name. So far, I haven't found any. – user1828605 Mar 03 '23 at 16:52
  • 1
    The keyword is look-up table. Gotta run now but gonna stitch up a short example later. – Andre Wildberg Mar 03 '23 at 16:56

1 Answers1

1

With adist, using data from City_and_province_list.csv with a unique list of Countries (modified on command line).

Also further simplifying your cities by cutting off names after a dash, see sub("-.*", "", x)

Note that you're using "Tampa Bay" but it should be "Tampa". That's why it's associated with "Mbamba Bay, Tanzania".

Only showing columns locale and country

c_c <- read.csv("City_and_province_list.mod", header=T, sep="\t")

c_c <- rbind(c_c, c("United States", "United States of America"))

cbind(df[,1:2], 
  country = sapply(df$locale, function(x) 
    c_c$Country[which.min(adist(sub("-.*", "", x), c_c$City, 
      ignore.case = T))]))[,2:3]
                       locale                  country
1               United States United States of America
2                      Israel                   Israel
3                   Stockholm                   Sweden
4                     Chicago United States of America
5               United States United States of America
6        Minneapolis-St. Paul United States of America
7                   Tampa Bay                 Tanzania
8                  Sacramento United States of America
9               United States United States of America
10                    Atlanta United States of America
11                Kansas City United States of America
12                Kansas City United States of America
13                  Charlotte United States of America
14                  San Diego United States of America
15                    Hamburg                  Germany
16       United Arab Emirates     United Arab Emirates
17                  Stockholm                   Sweden
18 Raleigh-Durham-Chapel Hill United States of America
19                   Slovakia                 Slovakia
20                     Latvia                   Latvia
21                     Sydney                Australia
22                     Sweden                   Sweden
23                    Romania                  Romania
24                     Canada                   Canada
25              United States United States of America
26               Jacksonville United States of America
27               Jacksonville United States of America
28                  Tampa Bay                 Tanzania
29                  Singapore                Singapore
30                    Bolzano                    Italy
31                   Scranton United States of America

Same approach using dplyr

library(dplyr)

df %>% 
  rowwise() %>% 
  mutate(country = c_c$Country[which.min(
    adist(sub("-.*", "", locale), c_c$City, ignore.case = T))]) %>% 
  ungroup()
# A tibble: 31 × 5
   location                            locale               area1 area2 country 
   <chr>                               <chr>                <chr> <chr> <chr>   
 1 United States                       United States        NA    NA    United …
 2 Israel                              Israel               NA    NA    Israel  
 3 Greater Stockholm Metropolitan Area Stockholm            NA    NA    Sweden  
 4 Greater Chicago Area                Chicago              NA    NA    United …
 5 United States                       United States        NA    NA    United …
 6 Greater Minneapolis-St. Paul Area   Minneapolis-St. Paul NA    NA    United …
 7 Greater Tampa Bay Area              Tampa Bay            NA    NA    Tanzania
 8 Greater Sacramento                  Sacramento           NA    NA    United …
 9 United States                       United States        NA    NA    United …
10 Atlanta Metropolitan Area           Atlanta              NA    NA    United …
# … with 21 more rows
Andre Wildberg
  • 12,344
  • 3
  • 12
  • 29
  • That's a great answer. Thanks. Is there a way I can use this in mutate? I just got a new set of data which already has some correct values. If I to run this with mutate and if_else, how would I do it? I tried something like this `df %>% mutate(country = countries$Country[which.min(adist(sub("-.*","",locale), countries$City, ignore.case = T))], area2)` but I get `NA` in all. – user1828605 Mar 06 '23 at 16:28