I have this dataset which consists of location in the following format:
1. city, province/state, country
2. city
3. province/state
4. country
Now, much of the dataset is cleaned using the following method, and I only took the ones that doesn't have the clean set.
## first, I tried to clean by removing some texts that are not helpful
df <- df%>%
add_column(location_clean = trimws(str_replace_all(.$location, "Greater|Metropolitan|Metro|Area", "")),
.after = "location")
## separate the location into locale, area, and country. locale will be used to gather the country names in the next step
df <- df%>%
separate(location_clean,
c("locale", "area1","area2"), sep=",",
remove = FALSE, extra="merge", fill="right") %>%
mutate(locale = trimws(locale), area1 = trimws(area1), area2 = trimws(area2))
df <- structure(list(location = c("United States", "Israel", "Greater Stockholm Metropolitan Area",
"Greater Chicago Area", "United States", "Greater Minneapolis-St. Paul Area",
"Greater Tampa Bay Area", "Greater Sacramento", "United States",
"Atlanta Metropolitan Area", "Kansas City Metropolitan Area",
"Kansas City Metropolitan Area", "Charlotte Metro", "San Diego Metropolitan Area",
"Greater Hamburg Area", "United Arab Emirates", "Greater Stockholm Metropolitan Area",
"Raleigh-Durham-Chapel Hill Area", "Slovakia", "Latvia", "Greater Sydney Area",
"Sweden", "Romania", "Canada", "United States", "Metro Jacksonville",
"Metro Jacksonville", "Greater Tampa Bay Area", "Singapore",
"Greater Bolzano Metropolitan Area", "Greater Scranton Area"),
locale = c("United States", "Israel", "Stockholm", "Chicago",
"United States", "Minneapolis-St. Paul", "Tampa Bay", "Sacramento",
"United States", "Atlanta", "Kansas City", "Kansas City",
"Charlotte", "San Diego", "Hamburg", "United Arab Emirates",
"Stockholm", "Raleigh-Durham-Chapel Hill", "Slovakia", "Latvia",
"Sydney", "Sweden", "Romania", "Canada", "United States",
"Jacksonville", "Jacksonville", "Tampa Bay", "Singapore",
"Bolzano", "Scranton"), area1 = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_), area2 = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
), country = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
)), row.names = c(NA, -31L), class = c("tbl_df", "tbl", "data.frame"
))
The df now consists of the original location and locale separated from the location column. I'm lost at how to get the country name from the city and country name in the locale
column (or location
column).
I have tried using world.cities
from the maps
package, but something is not working correctly when I mutate
the country
column to get the country name from city names in the locale
column.
For example: when I run the following code, I get inconsistent results.
get_country_name <- function(city){
w <- world.cities %>%
filter(name == city) %>%
arrange(desc(pop)) %>% head(1) %>% pull(country.etc)
return(w)
}
df %>%
mutate(country = ifelse(is.na(country), unlist(sapply(locale, get_country_name)), country))
Another issue is that there may be multiple countries with the same city name. One approach I'm taking is to select the country by the population, essentially, making an assumption that only people from big cities are in the database. If there's a better way to do this, that'd be great.
Output
The expected result is the dataframe with the correct country name filled.
df <- structure(list(location = c("United States", "Israel", "Greater Stockholm Metropolitan Area",
"Greater Chicago Area", "United States", "Greater Minneapolis-St. Paul Area",
"Greater Tampa Bay Area", "Greater Sacramento", "United States",
"Atlanta Metropolitan Area", "Kansas City Metropolitan Area",
"Kansas City Metropolitan Area", "Charlotte Metro", "San Diego Metropolitan Area",
"Greater Hamburg Area", "United Arab Emirates"), locale = c("United States",
"Israel", "Stockholm", "Chicago", "United States", "Minneapolis-St. Paul",
"Tampa Bay", "Sacramento", "United States", "Atlanta", "Kansas City",
"Kansas City", "Charlotte", "San Diego", "Hamburg", "United Arab Emirates"
), area1 = c(NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_), area2 = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_),
country = c("United States", "Israel", "Sweden",
"United States", "United States", "United States", "United States",
"United States", "United States", "United States", "United States",
"United States", "United States", "United States", "Germany",
"United Arab Emirates")), row.names = c(NA, -16L), class = c("tbl_df",
"tbl", "data.frame"))