0

I am using R for text mining and have data that have been concatenated from different text columns. There are cases where words have been split by a space like"functi oning". I want to detect all such cases and remove space in between by doing dictionary check. I know splitWords function in aspell, I want a function exactly opposite of what this does.

lawyeR
  • 7,488
  • 5
  • 33
  • 63

1 Answers1

1

Here is an approach, based on some code I found, but you need to provide some example text and even just pseudo code to help others respond.

First create an object that has a huge set of words spelled correctly. Then you compare your vector of words to that set with adist and an argument set to a single difference -- ideally, the internal spaces you would like to remove. I doubt that this will solve everything, but it may help.

sorted_words <- comments(sort(table(strsplit(tolower(paste(readLines("http://www.norvig.com/big.txt"), collapse = " ")), "[^a-z]+")), decreasing = TRUE))

correct <- function(*your vector*) { c(sorted_words[adist(*your vector*, sorted_words) <= min(adist(word, sorted_words), 2)], word)[1] }

Then use the correct function.

Christopher Bottoms
  • 11,218
  • 8
  • 50
  • 99
lawyeR
  • 7,488
  • 5
  • 33
  • 63