1

My task is to extract specific words (the first word of the species name) from titles of journal articles. Here is a reproducible version of my dataset:

df <- data.frame(article_title = c("I like chickens and how to find chickens",
                     "A Horse hootio is going to the rainbow",
                     "A Cat caticus is eating cheese", 
                     "A Dog dogigo runs over a car",
                     "A Hippa potamus is in the sauna", # contains mispelling 
                     "Mos musculus found on a boat", # contains mispelling 
                     "A sentence not related to animals"))

The key words I want to extract are the following (with regex boundary wrappers):

words_to_match <- c('\\bchicken\\b', '\\bhorse\\b', '\\bcat\\b', 
                    '\\bdog\\b',
                    '\\bhippo\\b', # hippo
                    '\\bmus\\b',  # mus
                    '\\banimals\\b') 

The problem is when I run this:

df %>% 
  dplyr::mutate(matched_word = stringr::str_extract_all(string = article_title, 
                                                        pattern = regex(paste(words_to_match, collapse = '|'), ignore_case = TRUE)))

Problem: some titles contain mispellings that are not detected.

                           article_title matched_word
1            Chicken chook finds a pearl      Chicken
2 A Horse hootio is going to the rainbow        Horse
3         A Cat caticus is eating cheese          Cat
4           A Dog dogigo runs over a car          Dog
5        A Hippa potamus is in the sauna             
6           Mos musculus found on a boat             
7      A sentence not related to animals      animals

What I want to be able to do is find a way to make another column that tells me if there is a possible match with my any words_to_match and perhaps the % match (Levenshtein distance).

Perhaps something like this:

                           article_title matched_word %
1            Chicken chook finds a pearl      Chicken 100
2 A Horse hootio is going to the rainbow        Horse 100
3         A Cat caticus is eating cheese          Cat 100
4           A Dog dogigo runs over a car          Dog 100
5        A Hippa potamus is in the sauna        Hippo XX   
6           Mos musculus found on a boat          Mus XX    
7      A sentence not related to animals      animals 100

Any suggestion would be appreciated even if it is not using R

PanOle
  • 65
  • 6

2 Answers2

3

You can use adist to find approximately matches:

x <- adist(words_to_match, df$article_title, fixed=FALSE, ignore.case = TRUE)
i <- apply(x, 1, which.min)
df$matched_word <- words_to_match[i]
df$adist <- mapply("[", asplit(x, 2), i)
df
#                             article_title  matched_word adist
#1 I like chickens and how to find chickens \\bchicken\\b     2
#2   A Horse hootio is going to the rainbow   \\bhorse\\b     0
#3           A Cat caticus is eating cheese     \\bcat\\b     0
#4             A Dog dogigo runs over a car     \\bdog\\b     0
#5          A Hippa potamus is in the sauna   \\bhippo\\b     1
#6             Mos musculus found on a boat     \\bmus\\b     1
#7        A sentence not related to animals \\banimals\\b     0
GKi
  • 37,245
  • 2
  • 26
  • 48
  • Hey @GKi, thanks for the suggestion. This is definitely going the right track. One issue I have is in my real dataset, my `words_to_match` vector is much larger than than the length of `df$article`. Is there a way to work around this issue? – PanOle Jul 23 '20 at 03:23
  • This should not cause a problem when your `words_to_match` vector is much larger than than the length of `df$article_title`. What error do you get? – GKi Aug 17 '20 at 06:37
2

You could put the words plain into a vector wm and strsplit each sentence. Then in an lapply use adist to get a distance matrix of each word to each element wm. The minimum should give you the best match. I'm not sure about your rationale of levenshtein distance (LD) in percents, though.

wm <- c("chicken", "horse", "cat", "dog", "hippo", "mus", "animals")

dl <- strsplit(df$article_title, " ")

res <- do.call(rbind, lapply(dl, function(x) {
  e  <- adist(tolower(x), wm)
  mins <- apply(e, 2, min)
  emin <- which.min(mins)
  data.frame(matched_word=wm[emin], LD=mins[emin])
}))
res
#   matched_word LD
# 1      chicken  1
# 2        horse  0
# 3          cat  0
# 4          dog  0
# 5        hippo  1
# 6          mus  1
# 7      animals  0

Data:

df <- structure(list(article_title = c("I like chickens and how to find chickens", 
"A Horse hootio is going to the rainbow", "A Cat caticus is eating cheese", 
"A Dog dogigo runs over a car", "A Hippa potamus is in the sauna", 
"Mos musculus found on a boat", "A sentence not related to animals"
)), class = "data.frame", row.names = c(NA, -7L))
jay.sf
  • 60,139
  • 8
  • 53
  • 110