1

I'm attempting to perform sentiment analysis based on http://tidytextmining.com/sentiment.html#the-sentiments-dataset . Prior to performing sentiment analysis I need to convert my dataset into a tidy format.

my dataset is of form :

x <- c( "test1" , "test2")
y <- c( "this is test text1" , "this is test text2")
res <- data.frame( "url" = x, "text" = y)
res
    url               text
1 test1 this is test text1
2 test2 this is test text2

In order to convert to one observation per row require to process text column and add new columns that contains word and number of times it appears for that url. Same url will appear in multiple rows.

Here is my attempt :

library(tidyverse)

x <- c( "test1" , "test2")
y <- c( "this is test text1" , "this is test text2")
res <- data.frame( "url" = x, "text" = y)
res

res_1 <- data.frame(res$text)
res_2 <- as_tibble(res_1)
res_2 %>% count(res.text, sort = TRUE) 

which returns :

# A tibble: 2 x 2
            res.text     n
              <fctr> <int>
1 this is test text1     1
2 this is test text2     1

How to count words in res$text dataframe and maintain url in order to perform sentiment analysis ?

Update :

x <- c( "test1" , "test2")
y <- c( "this is test text1" , "this is test text2")
res <- data.frame( "url" = x, "text" = y)
res

res %>%
group_by(url) %>%
transform(text = strsplit(text, " ", fixed = TRUE)) %>%
unnest() %>%
count(url, text) 

returns error :

Error in strsplit(text, " ", fixed = TRUE) : non-character argument

I'm attempting to convert to tibble as this appears to be format required for tidytextmining sentiment analysis : http://tidytextmining.com/sentiment.html#the-sentiments-dataset

jazzurro
  • 23,179
  • 35
  • 66
  • 76
blue-sky
  • 51,962
  • 152
  • 427
  • 752
  • Why do you need to convert it tibble? In other words, your title doesn't seem to represent the actual question. It seems you just want a word could per url. I think one possible tibbliverse approach could be `res %>% group_by(url) %>% transform(text = strsplit(text, " ", fixed = TRUE)) %>% unnest() %>% count(url, text)` (assuming `text` is a string and not a factor) – David Arenburg Dec 02 '17 at 23:49
  • @DavidArenburg please see update – blue-sky Dec 03 '17 at 00:40

1 Answers1

4

Are you looking for something like this? When you want to handle sentiment analysis with the tidytext package, you need to separate words in each character string with unnest_tokens(). This function can do more than separating texts into words. If you want have a look of the function later. Once you have a word per row, you can count how many times each word appears in each text using count(). Then, you want to remove stop words. The tidytext package has the data, so you can call it. Finally, you need to have sentiment information. Here, I chose AFINN, but you can choose another if you want. I hope this will help you.

x <- c( "text1" , "text2")
y <- c( "I am very happy and feeling great." , "I am very sad and feeling low")
res <- data.frame( "url" = x, "text" = y, stringsAsFactors = F)

#    url                               text
#1 text1 I am very happy and feeling great.
#2 text2      I am very sad and feeling low

library(tidytext)
library(dplyr)

data(stop_words)
afinn <- get_sentiments("afinn")

unnest_tokens(res, input = text, output = word) %>%
count(url, word) %>%
filter(!word %in% stop_words$word) %>%
inner_join(afinn, by = "word")

#    url    word     n score
#  <chr>   <chr> <int> <int>
#1 text1 feeling     1     1
#2 text1   happy     1     3
#3 text2 feeling     1     1
#4 text2     sad     1    -2
jazzurro
  • 23,179
  • 35
  • 66
  • 76