-1

I am new in R and coding world, pardon if i perhaps mispelled some or more jargon here (cmiiw).

I am facing a challenge to clean city name in a dataframe.

enter image description here

Tried to use GetCloseMatches, strdist_inner_join (with fuzzywuzzy i believe) with dplyr style but still haven't meet my needs.

1st attempt:

vec3 = unlist(world.cities$name)

str1 = c('Jakarta Utara')

GetCloseMatches(string = str1, sequence_strings = vec3, n = 1L, cutoff = 0.6)

but it can only "translate" one of city each time, do you know how to make it repeat for all of the dataframe? for loop or function?

2nd attempt:

df2 <- df[1:10,] %>%

stringdist_left_join(world.cities, by = c(cust_city = "name"), max_dist = 1)

it shows most of the city but missing the "Jakarta Utara"

I am using two database/dataframe(cmiiw) of the city to be checked with (If you see the "Look Up" table on the right side, it has hundreds of city name, not only 6), first is SHP files that i fortified, second is world.cities$name, both are doing great but somehow it only appear one city at a time. ie: if i am using SHP files, Jakarta Utara is appear but Karawang is not, vice versa.

My Goal is to replace the left word to the right word (1 to 2)

enter image description here

left > right

Karawang - to Karawang

Jakarta Utara to Jakarta

Jakarta to Jakarta, etc

Do you know the most efficient way to do it?

Thank you very much for your helps!

regards

rgoei
  • 1
  • 3
  • 1
    Hum, maybe you could vectorize the function using `Vectorize` (I made a post about it [here](https://biolitika.si/vectorizing-functions-in-r-is-easy.html))? – Roman Luštrik Apr 15 '21 at 05:42
  • There are a few solutions here. If your question was answered by one of us. Please close your question by clicking the checkmark to the left of the answer that helped you most. Thank you. – Shawn Janzen Apr 15 '21 at 09:33

4 Answers4

1

You can use map and str_detect. Please let me know if it works.

library(tidyverse)

df %>%
  mutate(City = map(City, ~df1$City[str_detect(.x,df1$City)])) %>% 
  unnest ()

Output:

# A tibble: 5 x 3
  Name      Qty City             
  <chr>   <dbl> <chr>            
1 Alex       10 Jakarta          
2 Bambang     5 Jakarta          
3 Charlie    15 Batam            
4 Delta      10 Tangerang Selatan
5 Emily       5 Jakarta   

Data:

df <- tribble(
~Name, ~Qty, ~City,
"Alex", 10, "Jakarta Barat",
"Bambang", 5, "Jakarta",
"Charlie", 15, "Nagoya Batam",
"Delta", 10, "Bintaro Tangerang Selatan",
"Emily", 5, "Tendean Jakarta Selatan 11750"
)


df1 <- tribble(
~City,
"Jakarta",
"Bandung",
"Batam",
"Surabay",
"Tangerang Selatan"
)

TarJae
  • 72,363
  • 6
  • 19
  • 66
0

I've updated the answer to pick up more countries using a match to maps::world.cities.

library(tidyverse)
library(maps)
library(fuzzyjoin)

wc <- world.cities %>% 
  as_tibble()

table <- data.frame(
  customers = seq(1, 5, 1),
  city = c(
    "Jakarta Barat",
    "Jakarta",
    "Nagoya Batam",
    "Bintaro Tangerang Selatan",
    "Tendean Jakarta Selatan 11750"
  )) %>% 
  as_tibble() %>% 
  mutate(country = "Indonesia")

table %>% 
  regex_inner_join(wc, 
                   by = c(city = "name",
                          country = "country.etc"))

I've added the country column to my data to make the joins more accurate. This could be extended to hundreds of cities.

Desmond
  • 1,047
  • 7
  • 14
  • Thank you for your response! What if the List table are as much as hundreds of city? (Not only 6 city in Look Up table) – rgoei Apr 15 '21 at 06:19
  • I've updated the answer to make it applicable to hundreds of cities/countries. – Desmond Apr 15 '21 at 07:03
0

You can use the case_when in dplyr to mapp your cities according your requirements

library(dplyr) # for mutate and case_when

# demo data
data_input <- data.frame(num = c(1,2,3,4,5), 
                         city = c("Jakarta Barat", "Jakarta", "Nagoya Batam", 
                                  "Bintaro Tangerang Seltan", "Tandean Jakarta Selatan"), 
                         stringsAsFactors = FALSE)

# Use case_when to mapp according to mapping table
output_reqd <- data_input %>% 
  mutate(new_city = case_when(grepl(pattern = "Jakarta", x = city) ~ "Jakarta",
                              grepl(pattern = "Batam", x = city) ~ "Batam",
                              grepl(pattern = "Tangerang Seltan", x = city) ~ "Tangerang Seltan",
                              TRUE ~ city)
         )
discipulus
  • 2,665
  • 3
  • 34
  • 51
  • 1
    Thank you for your response! What if the List table are as much as hundreds of city? (Not only 6 city in Look Up table) – rgoei Apr 15 '21 at 06:19
  • @rgoei see my solution with a single vector of cities and use of a for loop. A similar solution could be adapted with the mutate & case_when as well. – Shawn Janzen Apr 15 '21 at 06:36
0

if I understand your question right, you want to parse the City variable against a list of known City names and replace the longer City name with the version on the known City name list. Right? If yes, then hopefully this approach will work for you (no additional packages needed):

# replicate your example data 1
d <-data.frame("No"=c(1,2,3,4,5),"Name"=c("Alex","Bambang","Charlie","Delta","Emily"),"Qty"=c(10,5,15,10,5),"City"=c("Jakarta Barat","Jakarta","Nagoya Batam","Bintaro Tangerang Selatan","Tendean Jakarta Selatan 11750"))
# replicate your vector of known city names
city_list <- c("Jakarta","Bandung","Batam","Surabaya","Tangerang Selatan")

# making a new placeholder variable to store the matched city names.
d$City_fix <- NA

# use a for loop, ifelse(), and grepl() to go through the vector of known cities, and replace the city name when a match is found.
for (i in 1:length(city_list)){
  d$City_fix <- ifelse(grepl(city_list[i], d$City), city_list[i], d$City_fix)
}

# view results
d

  No    Name Qty                          City          City_fix
1  1    Alex  10                 Jakarta Barat           Jakarta
2  2 Bambang   5                       Jakarta           Jakarta
3  3 Charlie  15                  Nagoya Batam             Batam
4  4   Delta  10     Bintaro Tangerang Selatan Tangerang Selatan
5  5   Emily   5 Tendean Jakarta Selatan 11750           Jakarta

Using a vector of known cities will allow you to as many cities as you want for the loop. Just be wary if you have two cities that might share parts of the same city name, e.g. "Big City" and "New Big City".

If your City names in the source dataframe or the list of cities have different lower & upper cases, you'll want to get that fixed first (e.g. tolower()from base R or str_to_title() from the stringr package).

The solution above also requires the city names in the dataframe are spelled correctly. If you have spelling errors, e.g. Jakerta instead of Jakarta, then a more complex solution is needed.

(edited to include mention ability to have a long list of known city names)

Shawn Janzen
  • 369
  • 3
  • 15
  • 1
    Hi Shawn, thank you for your response, i have tested your code for some row of df and it works very well ! First i use world.cities and filter it to only one Country appear, then i unlist the world.cities$name, placed it on your For-Loop code and voila! i will make sure that it performed well on overall data. From now, thank you very much ! :D – rgoei Apr 15 '21 at 08:53
  • May i ask you the meaning/ purpose of [i] in city_list[i] ? – rgoei Apr 21 '21 at 03:47
  • Sorry for the late reply--just catching your follow-up question now. i in the for loop refers to index, which will then take the first value numerical in the specified loop and count upward on each pass through the loop. In this example, the specified loop is 1:length(city_list), which in this case will be 1:5 and i will be the index value of that loop. E.g., the first pass i=1 for each instance of i such that: d$City_fix <- ifelse(grepl(city_list[1], d$City), city_list[1], d$City_fix) which makes city_list[1] equal the first value of that object or "Jakarta", and so on for each loop. – Shawn Janzen Jun 03 '21 at 17:54