9

I am text mining a large database to create indicator variables which indicate the occurrence of certain phrases in a comments field of an observation. The comments were entered by technicians, so the terms used are always consistent.

However, there are some cases where the technicians misspelled a word, and so my grepl() function doesn't catch that the phrase (albeit mispelled) occurred in an observation. Ideally, I would like to be able to submit each word in a phrase to a function, which would return several common misspellings or typos of said word. Does such an R function exist?

With this, I could search for all possible combinations of these misspellings of the phrase in the comments field, and output that to another data frame. This way, I could look at each occurence on a case-by-case basis to determine if the phenomenon I am interested in was actually described by the technician.

I have Googled around, but have only found references to actual spell checking packages for R. What I am looking for is an "inverse" spell checker. Since the number of phrases I am looking for is relatively small, I would realistically be able to check for misspellings by hand; I just figured it would be nice to have this ability built into an R package for future text mining efforts.

Thank you for your time!

SimplePi
  • 95
  • 1
  • 4
  • 15
Nick Evans
  • 535
  • 3
  • 12
  • 2
    I think you're looking for approximate string matching algorithms like `agrep`. Type `?agrep` in R. – Arun Feb 01 '13 at 21:07
  • 1
    I don't think it will help in this particular case but the **utils** package that comes with R has some spell checking ability in the form of `aspell()`. A [paper](http://journal.r-project.org/archive/2011-2/RJournal_2011-2_Hornik+Murdoch.pdf) on this was in the R Journal a few issues back. – Gavin Simpson Feb 01 '13 at 21:20
  • I had a look at the agrep as well as the formulation of the Levenshtein distance, and it appears like either this or the Damerau-Levenshtein distance measure based search would suit my needs. I'll test it for a little while and see how it goes. – Nick Evans Feb 01 '13 at 21:24
  • The issue I have found is that the matching is not predictable. I could make this method work for me though. Right now, the 'agrep' function seems to be a bit of a black box in that I am not 100% sure of what it is matching and what it is not. I'm going to try to write a function that returns all possible alphanumeric strings given a distance so that I can be more sure. – Nick Evans Feb 01 '13 at 21:41

1 Answers1

5

As Gavin Simpson suggested, you can use aspell. I guess for this to work you need aspell installed. In many linux distributions it is by default; I don't know about other systems or whether it is installed with R.

See the following function for an example of use. It depends on your input data and what exactly you want to do with the result (e.g. correct misspelling with the first suggestion) which you didn't specify:

check_spelling <- function(text) {
  # Create a file with on each line one of the words we want to check
  text <- gsub("[,.]", "", text)
  text <- strsplit(text, " ", fixed=TRUE)[[1]]
  filename <- tempfile()
  writeLines(text, con = filename);
  # Check spelling of file using aspell
  result <- aspell(filename)
  # Extract list of suggestions from result
  suggestions <- result$Suggestions
  names(suggestions) <- result$Original
  unlink(filename)
  suggestions
}

> text <- "I am text mining a large database to create indicator variables which indicate the occurence of certain phrases in a comments field of an observation. The comments were entered by technicians, so the terms used are always consistent. "
> check_spelling(text)
$occurence
[1] "occurrence"   "occurrences"  "occurrence's"
leo
  • 3,677
  • 7
  • 34
  • 46
Jan van der Laan
  • 8,005
  • 1
  • 20
  • 35
  • When I try your example on my Mac, I've got an error message telling that there is no suitable spell-checker program on my computer. How can I choose of find a spell-checker program ? – PAC Jun 19 '13 at 13:56
  • 1
    @PAC You need an spellchecker that supports the ispell interface, such as aspell or hunspell. Googling on 'aspell' and 'mac' gives multiple hits on how to install (e.g. http://dbader.org/blog/spell-checking-latex-documents-with-aspell). Also see http://journal.r-project.org/archive/2011-2/RJournal_2011-2_Hornik+Murdoch.pdf – Jan van der Laan Jun 20 '13 at 08:54