Efficient custom stemming in R package tm

Question

This is a code that (seems to) work but I'm looking for a way to shorten my code. Which implements custom stemming or grouping of words.

# Simple reproducible example:
library(tm)
vec <- c("partners, very good", "partnery SOso Goodish!", "partna goodies", 
         "Good night")
corp <- Corpus(VectorSource(vec))
corp <- tm_map(corp, tolower)
corp <- tm_map(corp, removePunctuation)

# Custom stemming (how to shorten this code and avoid reptition)
corp <- tm_map(
  corp, 
  content_transformer(gsub), 
  pattern = "good[^ ]*", 
  replacement = "good"
)
corp <- tm_map(
  corp, 
  content_transformer(gsub), 
  pattern = "partn[^ ]*", 
  replacement = "partn"
)

Background: I can't use standard stemming methods since:

No stemming algorithms for the language of the text I'm analysing have been developed.
I'm using the code to group terms that are closely related in meaning (but not spelling) for later feeding the data to clustering algorithms.

EDIT

I've reached a somewhat satisfactory and more scalable solution but I still have a feeling this is not the way this should be done...

# Make a list of pattern/replacment pairs
steml <- list(
  c("good[^ ]*", "good"),
  c("partn[^ ]*", "partn")
)

for (pair in seq_along(steml)) {
  corp <- tm_map(
    corp, 
    content_transformer(gsub), 
    pattern     = steml[[pair]][1],
    replacement = steml[[pair]][2]
  )
}

Might be worth looking at [this answer](https://stackoverflow.com/questions/46731429/quanteda-fastest-way-to-replace-tokens-with-lemma-from-dictionary?s=1|100.9600) and [this one](https://stackoverflow.com/questions/46731429/quanteda-fastest-way-to-replace-tokens-with-lemma-from-dictionary/46742533?s=2|28.2251#46742533). — Ken Benoit, Feb 14 '18 at 21:18
I am definitely suggesting you give **quanteda** a try - it has more direct functionality for what you need than **tm**. — Ken Benoit, Feb 15 '18 at 09:26

Efficient custom stemming in R package tm

0 Answers0