This is my first post with StackOverflow, I apologize if I violate any rules.
I am working with the R package qdap
on spellchecking very messy medical record text. The goal of this work is to identify misspellings of drug side effects in order to build a side effect misspelling dictionary. The text I am working with contains many, many misspellings, abbreviations, and other things that make a simple spellcheck difficult. After I run a spellcheck on a small doctors note, I get hundreds of words returned to me by the spellcheck program. This makes it difficult to search for the side effect misspellings that I care about.
I attempted to use the following code to create a dictionary consisting only of correctly spelled side effects, so that qdap
will trigger closely misspelled words as belonging to this dictionary. The problem is that with this, nearly every word in the text, properly or improperly spelled is not returned as incorrect (i.e. "notable" is spelled wrong and "nausea" is the suggested replacement from my dictionary).
dictionary <- readLines("dictionary.txt")
check_spelling(text$NOTE_TEXT[3379],range = 0, dictionary = dictionary,
assume.first.correct=FALSE)
Here the term "dictionary" is my self-built side-effects dictionary, and check_spelling
is being run on text contained in a csv file. Is there any way to omit words that are very far away from words contained in my dictionary from appearing in the spellcheck function (such as my previous example)? This way I can cut down the number of words I am seeing in my spell_check output and identify only the misspelled side effects.
As a small note, changing assume.first.correct
to TRUE
will not change anything, because the dictionary does not run with it set that way.