My task is to extract specific words (the first word of the species name) from titles of journal articles. Here is a reproducible version of my dataset:
df <- data.frame(article_title = c("I like chickens and how to find chickens",
"A Horse hootio is going to the rainbow",
"A Cat caticus is eating cheese",
"A Dog dogigo runs over a car",
"A Hippa potamus is in the sauna", # contains mispelling
"Mos musculus found on a boat", # contains mispelling
"A sentence not related to animals"))
The key words I want to extract are the following (with regex boundary wrappers):
words_to_match <- c('\\bchicken\\b', '\\bhorse\\b', '\\bcat\\b',
'\\bdog\\b',
'\\bhippo\\b', # hippo
'\\bmus\\b', # mus
'\\banimals\\b')
The problem is when I run this:
df %>%
dplyr::mutate(matched_word = stringr::str_extract_all(string = article_title,
pattern = regex(paste(words_to_match, collapse = '|'), ignore_case = TRUE)))
Problem: some titles contain mispellings that are not detected.
article_title matched_word
1 Chicken chook finds a pearl Chicken
2 A Horse hootio is going to the rainbow Horse
3 A Cat caticus is eating cheese Cat
4 A Dog dogigo runs over a car Dog
5 A Hippa potamus is in the sauna
6 Mos musculus found on a boat
7 A sentence not related to animals animals
What I want to be able to do is find a way to make another column that tells me if there is a possible match with my any words_to_match
and perhaps the % match (Levenshtein distance).
Perhaps something like this:
article_title matched_word %
1 Chicken chook finds a pearl Chicken 100
2 A Horse hootio is going to the rainbow Horse 100
3 A Cat caticus is eating cheese Cat 100
4 A Dog dogigo runs over a car Dog 100
5 A Hippa potamus is in the sauna Hippo XX
6 Mos musculus found on a boat Mus XX
7 A sentence not related to animals animals 100
Any suggestion would be appreciated even if it is not using R