Text Mining - Splitting Texts Into Individual Observations

Question

I have a dataset of a unique ID and a sentence for each ID. I would like to break up the sentence by words and remove the stopwords to clean the data for further analysis.

Example of dataset: 
ID  Sentence
1  The quick brown fox 
2  Feel free to be

Breaking up sentence: 
ID  Word 
1  The 
1  quick 
1  brown 
1  fox 
2  Feel 
2  free 
2  to 
2  be 

Removing the stopwords: 
ID  Word
1  quick 
1  brown 
1  fox 
2  Feel 
2  free

I already have the IDs and sentences in a dataframe. What would be a suitable function to break up the texts including removing of punctuations after each word if any and then removing the rows with stopwords.

Have a look of [this question](https://stackoverflow.com/questions/47613678/converting-data-frame-to-tibble-with-word-count/47614496#47614496). I think this can help you. You need to learn how to use `unnest_tokens()`. — jazzurro, Jan 25 '18 at 03:01
I have tried the function but faced may difficulties. Here are some of the errors I am facing. Error: Can't convert NULL to a quosure, Error in typeof(x) : object 'word' not found, Error in check_input(x) : Input must be a character vector of any length or a list of character vectors, each of which has a length of 1. — xyn, Jan 25 '18 at 05:51
I just left a demo for you. I do not have your actual data. So I cannot really give you more than what I wrote. Change whichever parts in the code and see what you can do. — jazzurro, Jan 25 '18 at 06:01

score 2 · Accepted Answer · answered Jan 25 '18 at 06:00

2

Using the tidytext package, you can do the following. The package has stopwords. You need to call the data. Then, you apply unnest_tokens() to the text column. You need to specify two names. One for the target column, and the other for a new column in the output. Once you tease apart the sentences, you subset data. Here I used filter() in the dplyr package.

library(dplyr)
library(tidytext)

foo <- data.frame(ID = c(1, 2),
                  Sentence = c("The quick brown fox", "Feel free to be"),
                  stringsAsFactors = FALSE)

data(stop_words)

unnest_tokens(foo, input = Sentence, output = word) %>%
filter(!word %in% stop_words$word)

  ID  word
1  1 quick
2  1 brown
3  1   fox
4  2  feel
5  2  free

answered Jan 25 '18 at 06:00

jazzurro

23,179
35
66
76

Wow stopword is data? Never knew that. Would you mind if I updated my solution with this package? – Onyambu Jan 25 '18 at 06:10
The tm package has stopwords in it as well. You surely do not want to specify all stopwords by hand, do you? – jazzurro Jan 25 '18 at 06:22
That's true for sure. Never met this before. But now I have seen it. Will need to learn more on this. Thanks so much – Onyambu Jan 25 '18 at 06:23
@Onyambu I am glad that you found something new today. :) – jazzurro Jan 25 '18 at 06:26
@xyn Pleasure to help you. – jazzurro Jan 25 '18 at 06:26

Onyambu · Answer 2 · 2018-01-25T06:25:13.820

  A=read.table(text="ID  Sentence
    1  'The quick brown fox' 
    2  'Feel free to be'",h=T,stringsAsFactors=F)


(dat=rev(stack(setNames(strsplit(A$Sentence," "),1:2))))

  ind values
1   1    The
2   1  quick
3   1  brown
4   1    fox
5   2   Feel
6   2   free
7   2     to
8   2     be


dat[-grep("The|to|be",dat$values),]
  ind values
2   1  quick
3   1  brown
4   1    fox
5   2   Feel
6   2   free

or you can do:

 dat[!dat$values%in%stop_words$word,TRUE),]

Text Mining - Splitting Texts Into Individual Observations

2 Answers2