4

So I have been searching for a long time on methods to correct typos in text in R, without manually adding/replacing words. I have data in text format that is the patients' complaints in an emergency department. After performing a simple Random Forest to choose the top 100 important features among them, this is the result I get:

> predictors(results1)

  [1] "back"       "refil"      "pain"       "med"        "cough"      "sob"        "day"        "chronic"    "deni"      
 [10] "right"      "brought"    "hit"        "request"    "injuri"     "hemorrhoid" "hour"       "clot"       "depress"   
 [19] "nausea"     "congest"    "clinic"     "headach"    "chest"      "sore"       "month"      "elev"       "dizzi"     
 [28] "toothach"   "week"       "throat"     "head"       "also"       "small"      "vomit"      "famili"     "seen"      
 [37] "burn"       "last"       "report"     "hematuria"  "per"        "walter"     "abdomin"    "ear"        "side"      
 [46] "low"        "nasal"      "intermitt"  "night"      "drh"        "dri"        "eye"        "obtain"     "patient"   
 [55] "pressur"    "product"    "take"       "vet"        "fever"      "blood"      "ago"        "due"        "extrem"    
 [64] "feel"       "note"       "triag"      "weak"       "aaa"        "aand"       "aarm"       "aava"       "abcess"    
 [73] "abcsess"    "abd"        "abdimin"    "abdnorm"    "abdomen"    "abdomi"     "abdominal"  "abdominla"  "abdonin"   
 [82] "abdpain"    "abil"       "abilifi"    "abl"        "ablat"      "abliat"     "abnd"       "abnorm"     "abouthi"   
 [91] "abraid"     "abraison"   "abras"      "abscess"    "absent"     "abul"       "abus"       "abuterol"   "abx"       
[100] "abxno"

The rows that start with [73] and [82] show how misspelling is going to affect my results. I have read about and tried Hunspell, Aspell, Soundex and vwr and the RecordLinkage package. The problem with Aspell is that, I can't make it work on my laptop knowing that it requires an old software to be installed on Windows and that software is very tricky to work with. With the other packages, my problem is that I don't want to look into 6k words one by one and add them to a list or compare them in "pairs" together or to a proper form. It would take ages to do. Do you have any suggestion for how I can write a code in R that automatically finds and replaces the closest words in spelling to the words in my data set? Or is there a way I can make the previously named packages do the same job?

Thank you.

Diana01
  • 183
  • 1
  • 1
  • 10
  • 3
    Peter Norvig made a big splash with his compact proposal for spell checker. Rasmus Bååth translated it into two lines of R: http://www.sumsar.net/blog/2014/12/peter-norvigs-spell-checker-in-two-lines-of-r/ – dmi3kno Jul 29 '17 at 07:54
  • I tried the spell checker but it's tricky to work with. The software is also very old. – Diana01 Sep 25 '17 at 08:01
  • Maybe the answer to [this](https://stackoverflow.com/questions/56026550/how-to-use-hunspell-package-to-suggest-correct-words-in-a-column-in-r) post can help. – Yeshyyy Jun 05 '20 at 01:02

0 Answers0