3

this is the text in my dataframe df which has a text column called 'problem_note_text'

SSCIssue: Note Dispenser Failureperformed checks / dispensor failure / asked the stores to take the note dispensor out and set it back / still error message says front door is open / hence CE attn reqContact details - Olivia taber 01159063390 / 7am-11pm

df$problem_note_text <- tolower(df$problem_note_text)
df$problem_note_text <- tm::removeNumbers(df$problem_note_text)
df$problem_note_text<- str_replace_all(df$problem_note_text, "  ", "") # replace double spaces with single space
df$problem_note_text = str_replace_all(df$problem_note_text, pattern = "[[:punct:]]", " ")
df$problem_note_text<- tm::removeWords(x = df$problem_note_text, stopwords(kind = 'english'))
Words = all_words(df$problem_note_text, begins.with=NULL)

Now have a dataframe which has a list of words but there are words like

"Failureperformed"

which needs to be split into two meaningful words like

"Failure" "performed".

how do I do this, also the words dataframe also contain words like

"im" , "h"

which do not make sense and have to be removed, I do not know how to achieve this.

Shweta Kamble
  • 432
  • 2
  • 10
  • 21
  • 3
    If there is no pattern, it is not doable – akrun Jun 22 '15 at 15:09
  • 5
    How would you treat something like `nowhere` - as `no` and `where` or `now` and `here`? – nrussell Jun 22 '15 at 15:10
  • I was thinking of maybe there is some dictionary available which could be used to parse the sentence. I used the qdap package all_words function to get words out of the sentences I had, but few words did not seem to be parsed well and I got joint words without meaning. – Shweta Kamble Jun 22 '15 at 15:22
  • Can you share a piece of the data? If sensor advised is in your document as two seperate words you may be able to change your pre-processing to avoid losing the space. – Steve Bronder Jun 22 '15 at 15:39
  • I'm guessing it may be because you have hyphen's separating characters (i.e., `sensor-advised`) in the original data. If you could share some of the data that's causing the issue (a simple search should reveal the initial words that cause the problem) we could direct you better. The following **qdap** vignette can help with debugging and cleaning text to isolate issues: http://cran.r-project.org/web/packages/qdap/vignettes/cleaning_and_debugging.pdf – Tyler Rinker Jun 22 '15 at 21:31
  • I agree with Steve_Corrin, better tokenization might solve this problem without the ambiguity of post-concatenation splitting through lookup. Try installing the dev branch of `quanteda`: `devtools::install_github("kbenoit/quantedaData")` and then if you use say `tokenize(df$problem_note_text, removePunct = TRUE)` then you should parse "sensor-advised" or those two words separated by any non-white-space/non-word character except `_`. – Ken Benoit Jun 22 '15 at 22:15
  • The problem is not the tokenization as far as I know, the problem is unclean text. let me share with you an example. – Shweta Kamble Jun 24 '15 at 17:26
  • I have edited the question again and also added in stuff that I have tried to achieve the goal, which is coming up with a column of sensible words and their frequency in the dataframe. @Steve_Corrin. Please take a look at it and let me know if I have done something wrong. – Shweta Kamble Jun 24 '15 at 19:45

1 Answers1

7

Given a list of English words you can do this pretty simply by looking up every possible split of the word in the list. I'll use the first Google hit I found for my word list, which contains about 70k lower-case words:

wl <- read.table("http://www-personal.umich.edu/~jlawler/wordlist")$V1

check.word <- function(x, wl) {
  x <- tolower(x)
  nc <- nchar(x)
  parts <- sapply(1:(nc-1), function(y) c(substr(x, 1, y), substr(x, y+1, nc)))
  parts[,parts[1,] %in% wl & parts[2,] %in% wl]
}

This sometimes works:

check.word("screenunable", wl)
# [1] "screen" "unable"
check.word("nowhere", wl)
#      [,1]    [,2]  
# [1,] "no"    "now" 
# [2,] "where" "here"

But also sometimes fails when the relevant words aren't in the word list (in this case "sensor" was missing):

check.word("sensoradvise", wl)
#     
# [1,]
# [2,]
"sensor" %in% wl
# [1] FALSE
"advise" %in% wl
# [1] TRUE
josliber
  • 43,891
  • 12
  • 98
  • 133