2

I have a corpus of a couple of thousand documents and I'm trying to find the most commonly mentioned countries in the abstracts.

The library countrycode seems to have a comprehensive list of country names I can match against:

# country.name.alt shows multiple potential namings for 'Congo' (yay!):
install.packages(countrycode)
countrycode::countryname_dict |> filter(grepl('congo', tolower(country.name.alt)))
# Also seems to work for ones like "China"/"People's Republic of China"

A reprex of the data looks something like this:

df <- data.frame(entry_number = 1:5,
                 text = c("a few paragraphs that might contain the country name congo or democratic republic of congo",
                          "More text that might contain myanmar or burma, as well as thailand",
                          "sentences that do not contain a country name can be returned as NA",
                          "some variant of U.S or the united states",
                          "something with an accent samóoa"))

I want to reduce each entry in the column "text" to contain only a country name. Ideally something like this (note the repeat entry number):

desired_df <- data.frame(entry_number = c(1, 2, 2, 3, 4, 5),
                     text = c("congo",
                              "myanmar",
                              "thailand",
                              NA,
                              "united states",
                              "samoa"))

I've attempted with str_extract and various other failed attempts! The corpus is in English but international alphabets included in countrycode::countryname_dict$country.name.alt do throw reges errors. countrycode::countryname_dict$country.name.alt contains all the alternatives that countrycode::countryname_dict$country.name.en does not...

Open to any approach (dplyr,data.table...) that answers the initial question of how many times each country is mentioned in the corpus. Only requirement is that it is as robust as possible to different potential country names, accents and any other hidden catches!

Thanks community!

P.S, I have reviewed the following questions but no luck with my own example:

QAsena
  • 603
  • 4
  • 9

1 Answers1

2

This seeems to work well on example data.

library(tidyverse)

all_country <- countrycode::countryname_dict %>% 
                  filter(grepl('[A-Za-z]', country.name.alt)) %>%
                  pull(country.name.alt) %>% 
                  tolower()
pattern <- str_c(all_country, collapse = '|')

df %>%
  mutate(country = str_extract_all(tolower(text), pattern)) %>%
  select(-text) %>%
  unnest(country, keep_empty = TRUE)

#  entry_number country                     
#         <int> <chr>                       
#1            1 congo                       
#2            1 democratic republic of congo
#3            2 myanma                      
#4            2 burma                       
#5            2 thailand                    
#6            3 NA                          
#7            4 united states               
#8            5 samóoa                 
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • An excellent solution (as always ;)). I wonder why does `separate_rows(country, sep = ",\\s")`not work here (instead of `unnest(country, keep_empty = TRUE)`)? – Chris Ruehlemann Oct 07 '21 at 06:36
  • `separate_rows` working with string values, `str_extract_all` gives us list values. (although both of them look the same) – Ronak Shah Oct 07 '21 at 06:52
  • Thanks very much @Ronak Shah (thought this might be one for you!)! I might suggest the following to match names back to the more generic `country.name.en` so that separate entries like myanmar and burma are both matched to myanmar (burma):`cuontry_match <- countrycode::countryname_dict |> mutate(across(everything(), tolower))` --- `res <- left_join(res, cuontry_match, by = c("country" = "country.name.alt"))`. Takes about 10-15 minutes to run on my full dataset (which is fine, I'll get a tea!) was wondering if `multidplyr` but help? – QAsena Oct 08 '21 at 02:23
  • It should but unfortunately, I don't have enough experience with it. – Ronak Shah Oct 08 '21 at 02:45
  • Neither, I was taking a look into the package now. It might rely on grouping to split processes. If I get the chance to dig into it more I'll suggest an addition for going parallel. Thanks very much for the help, so far as I can tell at the moment it is working on the full data :) – QAsena Oct 08 '21 at 02:51