I am using R for text mining and have data that have been concatenated from different text columns. There are cases where words have been split by a space like"functi oning". I want to detect all such cases and remove space in between by doing dictionary check. I know splitWords function in aspell, I want a function exactly opposite of what this does.
Asked
Active
Viewed 352 times
1 Answers
1
Here is an approach, based on some code I found, but you need to provide some example text and even just pseudo code to help others respond.
First create an object that has a huge set of words spelled correctly. Then you compare your vector of words to that set with adist
and an argument set to a single difference -- ideally, the internal spaces you would like to remove. I doubt that this will solve everything, but it may help.
sorted_words <- comments(sort(table(strsplit(tolower(paste(readLines("http://www.norvig.com/big.txt"), collapse = " ")), "[^a-z]+")), decreasing = TRUE))
correct <- function(*your vector*) { c(sorted_words[adist(*your vector*, sorted_words) <= min(adist(word, sorted_words), 2)], word)[1] }
Then use the correct
function.

Christopher Bottoms
- 11,218
- 8
- 50
- 99

lawyeR
- 7,488
- 5
- 33
- 63