Are there text processing function that operate on word level in R?

Question

I am trying to find a group of functions in R that would operate on word level. e.g. a function that could return the position of the word. For example given the following sentence and query

sentence <- "A sample sentence for demo"
query <- "for"

the function would return 4. for is 4th word.
It would be great if I could get a utility function that would allow me to extend query both in left and right direction. e.g. extend(query, 'right') would return for demo and extend(query, 'left') would return sentence for

I have already gone through functions like grep, gregexp, word from stringr package and others. All seem to operate on character level.

Check out `stringr::word`. As in: `word(string, start = 1L, end = start, sep = fixed(" "))`. You can also use `end = -2L` to get the final two words. — p0bs, Apr 02 '17 at 16:23

score 1 · Answer 1 · answered Apr 02 '17 at 17:52

If you use scan, it will split input at whitespace:

> s.scan <- scan(text=sentence, what="")
Read 5 items
> which(s.scan == query)
[1] 4

Need the what="" to tell scan to expect character rather than numeric input. Might need to replace punctuation using gsub with patt="[[:punct:]]" if your input is ever full English sentences. May also need to look at the tm (text mining) package if you are trying to classify parts of speech or handle large documents.

score 0 · Answer 2 · answered Apr 02 '17 at 16:32

As I mentioned in my comment, stringr is useful in these instances.

library(stringr)

sentence <- "A sample sentence for demo"
wordNumber <- 4L

fourthWord <- word(string = sentence,
                   start = wordNumber)

previousWords <- word(string = sentence,
                       start = wordNumber - 1L,
                       end = wordNumber)

laterWords <- word(string = sentence,
                   start = wordNumber,
                   end = wordNumber + 1L)

And this yields:

> fourthWord
[1] "for"
> previousWords
[1] "sentence for"
> laterWords
[1] "for demo"

I hope that helps you.

score 0 · Accepted Answer · answered Apr 06 '17 at 19:46

I have written my own functions, the indexOf method returns the index of the word if it is found in the sentence otherwise returns -1, very much like java indexOf()

indexOf <- function(sentence, word){
  listOfWords <- strsplit(sentence, split = " ")
  sentenceAsVector <- unlist(listOfWords)

  if(word %in% sentenceAsVector == FALSE){
    result=-1
  }
  else{
  result = which(sentenceAsVector==word)
  }
  return(result)
}

The extend method is working properly but is quite lengthy doesn't look like R code at all. If query is a word on the boundary of the sentence, i.e. the first word or the last word, first two words or last two words are returned

extend <- function(sentence, query, direction){
  listOfWords = strsplit(sentence, split = " ")
  sentenceAsVector = unlist(listOfWords)
  lengthOfSentence = length(sentenceAsVector)
  location = indexOf(sentence, query)
  boundary = FALSE
  if(location == 1 | location == lengthOfSentence){
    boundary = TRUE
  }
  else{
    boundary = FALSE
  } 
  if(!boundary){ 
    if(location> 1 & direction == "right"){  
      return(paste(sentenceAsVector[location], 
                   sentenceAsVector[location + 1],
                   sep=" ")
      )
    }
    else if(location < lengthOfSentence & direction == "left"){
      return(paste(sentenceAsVector[location - 1], 
                   sentenceAsVector[location],
                   sep=" ")
      )

    }
  }
  else{
    if(location == 1 ){
      return(paste(sentenceAsVector[1], sentenceAsVector[2], sep = " "))
    }
    if(location == lengthOfSentence){
      return(paste(sentenceAsVector[lengthOfSentence - 1],
                   sentenceAsVector[lengthOfSentence], sep = " "))
    }
  } 
}

score 0 · Answer 4 · answered Oct 04 '17 at 22:04

The answer depends on what you mean by a "word". If you mean whitespace-separated token, then @imran-ali's answer works fine. If you mean word as defined by Unicode, with special attention to punctuation, then you need something more sophisticated.

The following handles punctuation correctly:

library(corpus)
sentence <- "A sample sentence for demo"
query <- "for"

# use text_locate to find all instances of the query, with context
text_locate(sentence, query)
##   text             before              instance              after              
## 1 1                 A sample sentence    for     demo             

# find the number of tokens before, then add 1 to get the position
text_ntoken(text_locate(sentence, query)$before) + 1
## 4

This also works if there are multiple matches:

sentence2 <- "for one, for two! for three? for four"
text_ntoken(text_locate(sentence2, query)$before) + 1
## [1]  1  4  7 10

We can verify that this is correct:

text_tokens(sentence2)[[1]][c(1, 4, 7, 10)]
## [1] "for" "for" "for" "for"

Are there text processing function that operate on word level in R?

4 Answers4