Misspelling-aware stemming with R Text Analysis

Question

I am new to TM package in R. I am trying to perform a word frequency analysis but I know that there are several spelling issues within my source file and I was wondering how can I fix these spelling errors before performing word frequencies analysis.

I read already another post (Stemming with R Text Analysis), but I have a question about the solution proposed in there: Is it possible to use a dictionary (a data frame, for example) to make several/all the replacements in my corpus before creating the TermDocumentMatrix and then the word frequency analysis??

I have a data frame with the dictionary and this have the following structure:

sept   -> september  
sep    -> september  
acct -> account  
serv  -> service  
servic     -> service  
adj    ->   adjustment  
ajuste   -> adjustment

I know I could develop a function to perform transformations on my corpus but I really do not know how to automatize this task and perform a loop or something like that with each record on my data frame.

Any help would be greatly appreciated.

It depends totally on what your corpus is - what language, what acronyms, what domain-specific terms etc. But if you just want a stemmer constructed automatically from a standard English (or whatever languages) dictionary, then [Tyler Rinker's answers](http://stackoverflow.com/questions/24443388/stemming-with-r-text-analysis/24454727#24454727) show what you want. All you need to add is code for synthesizing likely misspellings, or use a word-distance metric like Levenshtein distance (see `adist`) to find the closest match in dictionary. — smci, May 27 '15 at 23:10

score 1 · Accepted Answer · edited May 23 '17 at 11:58

1

For the basic automatic construction of a stemmer from a standard English dictionary, Tyler Rinker's answers already shows what you want.

All you need to add is code for synthesizing likely misspellings, or matching (common) misspellings in your corpus using a word-distance metric like Levenshtein distance (see adist) to find the closest match in the dictionary.

edited May 23 '17 at 11:58

Community

1
1

answered May 27 '15 at 23:16

smci

32,567
20
113
146

Actually the corpus is on Spanish Language. Do you know where can I find a guidance on how to construct the stemmer for spanish language? – OOP Jun 10 '15 at 23:18
Just use MrFlick's answer with `tm_map(corpus, stemDocument, language = "spanish")` – smci Jun 10 '15 at 23:27
`tm::stemDocument(... language='lang')` calls `SnowballC:::wordStem()` which has builtin stemmers for all the languages in `SnowballC:::getStemLanguages()` i.e. *"danish" "dutch" "english" "finnish" "french" "german" "hungarian" "italian" "norwegian" "porter" "portuguese" "romanian" "russian" "spanish" "swedish" "turkish"*. See the documentation for `tm` and `SnowballC` packages. – smci Jun 11 '15 at 00:19

Misspelling-aware stemming with R Text Analysis

1 Answers1