regex/stringr: splitting joined/sequence of countrynames

Question

I have a string which contains multiple country names put together. The names are not separated by any pattern other than that a capital letter follows a small letter without a space (spaces are however part of some country name, e.g. Democratic Republic of Congo.

My stringr/regex attempt is rather close, but I am losing the first letter of the second and subsequent country names. Any help? Many thanks.

library(tidyverse)
#> Warning: package 'dplyr' was built under R version 3.6.2
#> Warning: package 'forcats' was built under R version 3.6.3
v <- structure(list(countries = c("Democratic Republic of the CongoSweden", 
                             "DenmarkIran (Islamic Republic of)", "AfghanistanSweden", "AzerbaijanSwedenGermany", 
                             "BangladeshSweden", "DenmarkSri Lanka", "CanadaSri Lanka", "DenmarkNigeria", 
                             "CanadaIreland", "CanadaMexico")), class = c("tbl_df", "tbl", 
                                                                          "data.frame"), row.names = c(NA, -10L))



v %>% 
  mutate(index=row_number()) %>% 
  #mutate(countries_split=str_split(countries, "[A-Z][a-z]*[a-z:space:]+(?=[A-Z])")) %>%
  #mutate(countries_split=str_split(countries, "(?<=[A-Z][a-z]{0,20}+[a-z:space:]{0,20}+[A-Z][a-z]{1,20}+).")) %>% 
  mutate(countries_split=str_split(countries, "(?<=[A-Z][a-z]{0,20}+[a-z:space:]{0,20}+)[A-Z]")) %>% 
  unnest(countries_split)
#> # A tibble: 21 x 3
#>    countries                              index countries_split                 
#>    <chr>                                  <int> <chr>                           
#>  1 Democratic Republic of the CongoSweden     1 Democratic Republic of the Congo
#>  2 Democratic Republic of the CongoSweden     1 weden                           
#>  3 DenmarkIran (Islamic Republic of)          2 Denmark                         
#>  4 DenmarkIran (Islamic Republic of)          2 ran (Islamic Republic of)       
#>  5 AfghanistanSweden                          3 Afghanistan                     
#>  6 AfghanistanSweden                          3 weden                           
#>  7 AzerbaijanSwedenGermany                    4 Azerbaijan                      
#>  8 AzerbaijanSwedenGermany                    4 weden                           
#>  9 AzerbaijanSwedenGermany                    4 ermany                          
#> 10 BangladeshSweden                           5 Bangladesh                      
#> # ... with 11 more rows

^{Created on 2020-03-06 by the reprex package (v0.3.0)}

Why not a simple and [well-known `"(?<=[a-z])(?=[A-Z])"` regex](https://stackoverflow.com/questions/43706474/splitting-string-between-capital-and-lowercase-character-in-r)? — Wiktor Stribiżew, Mar 06 '20 at 10:00

score 3 · Accepted Answer · answered Mar 06 '20 at 09:56

We can use positive lookahead to capture the second group.

library(tidyverse)

v %>%
  mutate(row = row_number(), 
         countries = str_split(countries, 
                   "(?<=[A-Z][a-z]{0,20}+[a-z:space:]{0,20}+)(?=[A-Z])")) %>%
  unnest(countries)

# A tibble: 21 x 2
#   countries                          row
#   <chr>                            <int>
# 1 Democratic Republic of the Congo     1
# 2 Sweden                               1
# 3 Denmark                              2
# 4 Iran (Islamic Republic of)           2
# 5 Afghanistan                          3
# 6 Sweden                               3
# 7 Azerbaijan                           4
# 8 Sweden                               4
# 9 Germany                              4
#10 Bangladesh                           5
# … with 11 more rows

regex/stringr: splitting joined/sequence of countrynames

1 Answers1