2

I am new to TM package in R. I am trying to perform a word frequency analysis but I know that there are several spelling issues within my source file and I was wondering how can I fix these spelling errors before performing word frequencies analysis.

I read already another post (Stemming with R Text Analysis), but I have a question about the solution proposed in there: Is it possible to use a dictionary (a data frame, for example) to make several/all the replacements in my corpus before creating the TermDocumentMatrix and then the word frequency analysis??

I have a data frame with the dictionary and this have the following structure:

sept   -> september  
sep    -> september  
acct -> account  
serv  -> service  
servic     -> service  
adj    ->   adjustment  
ajuste   -> adjustment  

I know I could develop a function to perform transformations on my corpus but I really do not know how to automatize this task and perform a loop or something like that with each record on my data frame.

Any help would be greatly appreciated.

Community
  • 1
  • 1
OOP
  • 25
  • 1
  • 6
  • 1
    It depends totally on what your corpus is - what language, what acronyms, what domain-specific terms etc. But if you just want a stemmer constructed automatically from a standard English (or whatever languages) dictionary, then [Tyler Rinker's answers](http://stackoverflow.com/questions/24443388/stemming-with-r-text-analysis/24454727#24454727) show what you want. All you need to add is code for synthesizing likely misspellings, or use a word-distance metric like Levenshtein distance (see `adist`) to find the closest match in dictionary. – smci May 27 '15 at 23:10

1 Answers1

1

For the basic automatic construction of a stemmer from a standard English dictionary, Tyler Rinker's answers already shows what you want.

All you need to add is code for synthesizing likely misspellings, or matching (common) misspellings in your corpus using a word-distance metric like Levenshtein distance (see adist) to find the closest match in the dictionary.

Community
  • 1
  • 1
smci
  • 32,567
  • 20
  • 113
  • 146
  • Actually the corpus is on Spanish Language. Do you know where can I find a guidance on how to construct the stemmer for spanish language? – OOP Jun 10 '15 at 23:18
  • Just use MrFlick's answer with `tm_map(corpus, stemDocument, language = "spanish")` – smci Jun 10 '15 at 23:27
  • `tm::stemDocument(... language='lang')` calls `SnowballC:::wordStem()` which has builtin stemmers for all the languages in `SnowballC:::getStemLanguages()` i.e. *"danish" "dutch" "english" "finnish" "french" "german" "hungarian" "italian" "norwegian" "porter" "portuguese" "romanian" "russian" "spanish" "swedish" "turkish"*. See the documentation for `tm` and `SnowballC` packages. – smci Jun 11 '15 at 00:19