1

This is my first post with StackOverflow, I apologize if I violate any rules.

I am working with the R package qdap on spellchecking very messy medical record text. The goal of this work is to identify misspellings of drug side effects in order to build a side effect misspelling dictionary. The text I am working with contains many, many misspellings, abbreviations, and other things that make a simple spellcheck difficult. After I run a spellcheck on a small doctors note, I get hundreds of words returned to me by the spellcheck program. This makes it difficult to search for the side effect misspellings that I care about.

I attempted to use the following code to create a dictionary consisting only of correctly spelled side effects, so that qdap will trigger closely misspelled words as belonging to this dictionary. The problem is that with this, nearly every word in the text, properly or improperly spelled is not returned as incorrect (i.e. "notable" is spelled wrong and "nausea" is the suggested replacement from my dictionary).

dictionary <- readLines("dictionary.txt")
check_spelling(text$NOTE_TEXT[3379],range = 0, dictionary = dictionary, 
    assume.first.correct=FALSE)

Here the term "dictionary" is my self-built side-effects dictionary, and check_spelling is being run on text contained in a csv file. Is there any way to omit words that are very far away from words contained in my dictionary from appearing in the spellcheck function (such as my previous example)? This way I can cut down the number of words I am seeing in my spell_check output and identify only the misspelled side effects.

As a small note, changing assume.first.correct to TRUE will not change anything, because the dictionary does not run with it set that way.

Scott
  • 15
  • 3
  • 1
    All I know is that qdap's author is terrible. He also might be a raptor. Stay vigilant - you don't want to be eaten. – Dason Apr 18 '17 at 14:52
  • 1
    @Dason This is true. – Tyler Rinker Apr 18 '17 at 14:58
  • @Scott I am the author of the qdap package (Dason is a contributor to package). This implementation of check_spelling uses a distance rule. SO the answer is no. I would recommend the excellent hunspell package to check spelling and suggest replacements. This package didn't exist at the time `check_spelling` was written but is available now. It is much more robust. https://cran.r-project.org/web/packages/hunspell/index.html – Tyler Rinker Apr 18 '17 at 15:05
  • Great Tyler (and Dason), I appreciate the help! – Scott Apr 18 '17 at 15:24

0 Answers0