1

qdap::mgsub takes the following parameters:

mgsub(x, pattern, replacement)

Within library(tm) corpus transformation you can wrap non tm functions within content_transformer(), e.g.

corpus <- tm_map(corpus, content_transformer(tolower))

Here is a data frame with some poorly spelt text:

df <- data.frame(
  id = 1:2,
  sometext = c("[cad] appls", "bannanas")
)

And here is a data frame with a custom lookup for misspelt words:

spldoc <- data.frame(
  incorrects = c("appls", "bnnanas"),
  corrects = c("apples", "bannanas")
)

Using mgsub outwith the context of corpus and content_transformer() I could just do this:

wrongs <- select(spldoc, incorrects)[,1] %>% paste0("\\b",.,"\\b") # prepend and append \\b to create word boundary regex
rights <- select(spldoc, corrects)[,1]
df$sometext <- mgsub(wrongs, rights, df$sometext, fixed = F)

But I can't see how I could write mgsub inside a function to pass to content_transformer() what would my parameter for x be as in mgsub(x, pattern, replacement)?

Doug Fir
  • 19,971
  • 47
  • 169
  • 299

1 Answers1

1

This is what I did:

# create separate function to pass into tm_map()

spelling_update <- content_transformer(function(x, lut) mgsub(paste0("\\b", lut[, 1], "\\b") , lut[, 2], x, fixed = F))

Then

corpus <- tm_map(corpus, spelling_update(spldoc))
Doug Fir
  • 19,971
  • 47
  • 169
  • 299