This is a code that (seems to) work but I'm looking for a way to shorten my code. Which implements custom stemming or grouping of words.
# Simple reproducible example:
library(tm)
vec <- c("partners, very good", "partnery SOso Goodish!", "partna goodies",
"Good night")
corp <- Corpus(VectorSource(vec))
corp <- tm_map(corp, tolower)
corp <- tm_map(corp, removePunctuation)
# Custom stemming (how to shorten this code and avoid reptition)
corp <- tm_map(
corp,
content_transformer(gsub),
pattern = "good[^ ]*",
replacement = "good"
)
corp <- tm_map(
corp,
content_transformer(gsub),
pattern = "partn[^ ]*",
replacement = "partn"
)
Background: I can't use standard stemming methods since:
- No stemming algorithms for the language of the text I'm analysing have been developed.
- I'm using the code to group terms that are closely related in meaning (but not spelling) for later feeding the data to clustering algorithms.
EDIT
I've reached a somewhat satisfactory and more scalable solution but I still have a feeling this is not the way this should be done...
# Make a list of pattern/replacment pairs
steml <- list(
c("good[^ ]*", "good"),
c("partn[^ ]*", "partn")
)
for (pair in seq_along(steml)) {
corp <- tm_map(
corp,
content_transformer(gsub),
pattern = steml[[pair]][1],
replacement = steml[[pair]][2]
)
}