2

My Goal is to identify whether a given text has a target string in it, but i want to allow for typos / small derivations and extract the substring that "caused" the match (to use it for further text analysis).

Example:

target <- "target string"
text <- "the target strlng: Butter. this text i dont want to extract."

Desired Output:

I would like to have target strlng as the Output, since ist very Close to the target (levenshtein distance of 1). And next i want to use target strlng to extract the word Butter (This part i have covered, i just add it to have a detailed spec).

What i tried:

Using adist did not work, since it compares two strings, not substrings.

Next i took a look at agrep which seems very Close. I can have the Output, that my target was found, but not the substring that "caused" the match.

I tried with value = TRUE but it seems to work on Array Level. I think It is not possible for me to Switch to Array type, because i can not split by spaces (my target string might have spaces,...).

agrep(
  pattern = target, 
  x = text,
  value = TRUE
)
oguz ismail
  • 1
  • 16
  • 47
  • 69
Tlatwork
  • 1,445
  • 12
  • 35

1 Answers1

2

Use aregexec, it's similar to the use of regexpr/regmatches (or gregexpr) for exact matches extraction.

m <- aregexec('string', 'text strlng wrong')
regmatches('text strlng wrong', m)
#[[1]]
#[1] "strlng"

This can be wrapped in a function that uses the arguments of both aregexec and regmatches. Note that in the latter case, the function argument invert comes after the dots argument ... so it must be a named argument.

aregextract <- function(pattern, text, ..., invert = FALSE){
  m <- aregexec(pattern, text, ...)
  regmatches(text, m, invert = invert)
}

aregextract(target, text)
#[[1]]
#[1] "target strlng"

aregextract(target, text, invert = TRUE)
#[[1]]
#[1] "the "                                       
#[2] ": Butter. this text i dont want to extract."
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66