I am using tokens_lookup
to see whether some texts contain the words in my dictionary discarding matches included in some pattern of words with nested_scope = "dictionary"
, as described in this answer. The idea is to discard longer dictionary matches which contain a nested target word (e.g. include Ireland but not Northern Ireland).
Now I'd like to:
(1) create a dummy variable indicating whether the text contains the words in the dictionary. I managed to do it with the code below but I don't understand why I have to write IE as lowercase in as.logical
.
df <- structure(list(num = c(2345, 3564, 3636), text = c("Ireland lorem ipsum", "Lorem ipsum Northern
Ireland", "Ireland lorem ipsum Northern Ireland")), row.names = c(NA, -3L),
class = c("tbl_df", "tbl", "data.frame"))
dict <- dictionary(list(IE = "Ireland", "Northern Ireland" = "Northern Ireland"),
tolower = F)
corpus <- corpus(df, text_field = "text")
toks <- tokens(corpus)
dfm <- tokens_lookup(toks, dictionary = dict, nested_scope = "dictionary", case_insensitive = F) %>%
tokens_remove("Northern Ireland") %>%
dfm()
df$contains <- as.logical(dfm[, "ie"], case_insensitive = FALSE)
(2) Store the matches in another column by using kwic
. Is there a way to exclude a dictionary key in kwic (Northern Ireland in the example)? In my attempt I get a keyword column that contains both Ireland and Norther Irland matches. (I don't know if it makes any difference, but in my full dataset I have multiple matches per row). Thank you.
words <- kwic(toks, pattern = dict, case_insensitive = FALSE)
df$docname = dfm@Dimnames[["docs"]]
df_keywords <- merge(df, words[ , c("keyword")], by = 'docname', all.x = T)
df_keywords <- df_keywords %>% group_by(docname, num) %>%
mutate(n = row_number()) %>%
pivot_wider(id_cols = c(docname, num, text, contains),
values_from = keyword, names_from = n, names_prefix = 'keyword')