0

I am performing a sentiment analysis using R, and I was wondering how to split the wordcloud into two parts, highlighting positive and negative words. I am quite new to R and the online solutions didn't help me. That is the code:

text <- readLines("product1.txt")

library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")

docs <- Corpus(VectorSource(text))

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")

docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removeWords, c("don", "s", "t")) 
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)

dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)

set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

And this is the result I would like to achieve:

enter image description here

Thanks for everyone will help me.

EDIT:

docs <- structure(list(content = c("This product so far has not disappointed. My children love to use it and I like the ability to monitor control what content they see with ease.", 
"Great for beginner or experienced person. Bought as a gift and she loves it.", 
"Inexpensive tablet for him to use and learn on, step up from the NABI. He was thrilled with it, learn how to Skype on it already.", 
"I have had my Fire HD 8 two weeks now and I love it. This tablet is a great value.We are Prime Members and that is where this tablet SHINES. I love being able to easily access all of the Prime content as well as movies you can download and watch laterThis has a 1280/800 screen which has some really nice look to it its nice and crisp and very bright infact it is brighter then the ipad pro costing $900 base model. The build on this fire is INSANELY AWESOME running at only 7.7mm thick and the smooth glossy feel on the back it is really amazing to hold its like the futuristic tab in ur hands."
), meta = structure(list(language = "en"), class = "CorpusMeta"), 
    dmeta = structure(list(), .Names = character(0), row.names = c(NA, 
    6L), class = "data.frame")), class = c("SimpleCorpus", "Corpus"
))
jazzurro
  • 23,179
  • 35
  • 66
  • 76
mrpls
  • 9
  • 1
  • 6

2 Answers2

3

As seen in the tutorial , to have such result, you should have a lexicon, i.e. a "dictionary" that gives you if a word is positive or negative. Having that info, you can use it to color your wordcloud.
We can comment the beautiful example in the link:

library(janeaustenr)
library(dplyr)
library(stringr)

# here we tidy up the corpus, all the J.Austen books, having them cleaned and as result, a tibble with words.
tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", 
                                                 ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

library(wordcloud)
library(reshape2)

As stated, you need a lexicon. The link talk about various lexicon, in this case it's using the bing one:

get_sentiments("bing")
# A tibble: 6,788 x 2
   word        sentiment
   <chr>       <chr>    
 1 2-faced     negative 
 2 2-faces     negative 
 3 a+          positive 
 4 abnormal    negative 
 5 abolish     negative 
 6 abominable  negative 
 7 abominably  negative 
 8 abominate   negative 
 9 abomination negative 
10 abort       negative 
# ... with 6,778 more rows

Now, joining every word of tidy_books (corpus) and the bing (lexicon) we can give a positive or negative value to each word:

library(wordcloud)
library(reshape2)

 tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)

And you'll have the desired output. Clearly you have to bend this to your data that I do not have.

enter image description here

EDIT:

Bended to your case, we can do this:

# take all the phrases
docs1 <-tibble(phrases =docs$content)

# add an id, from 1 to n
docs1$ID <- row.names(docs1)

# split all the words
tidy_docs <- docs1 %>% unnest_tokens(word, phrases)

#create now the cloud: a pair of warnings, because you do not have negative words and it is joining by word(correct)
tidy_docs %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)

enter image description here

s__
  • 9,270
  • 3
  • 27
  • 45
  • hi s_t, thanks a lot for replying. I have also seen that tutorial. I took the image from it. However, I found some issues in adapting it to my case. It gives me errors like: Error in UseMethod("inner_join") : no applicable method for 'inner_join' applied to an object of class "c('SimpleCorpus', 'Corpus')". I am using "docs" as a corpus. – mrpls Sep 04 '18 at 11:48
  • Hi @mrpls, from where have you take the docs data? Are yours or from a package? If so, which package? – s__ Sep 04 '18 at 11:53
  • It's mine, but it is a super simple .txt file with some products reviews. Thanks for your help. – mrpls Sep 04 '18 at 12:00
  • @mrpls If it's not a problem, you can `dput(head(docs))` or a part of it, to copy paste in my R to detect problem. If it's a problem, you should post some fake data similar to yours to copy paste and recreate your problem (and add in the question editing it). – s__ Sep 04 '18 at 12:11
  • I edited the original question. Hopefully it is understandable. As you can see it came from a very simple text file with some reviews. – mrpls Sep 04 '18 at 12:29
  • @mrpls edited with your data, hope it's the correct answer! – s__ Sep 04 '18 at 13:04
  • @s_t Something is not right. You should expect `disappointed` in the word cloud. – jazzurro Sep 04 '18 at 13:07
  • @s_t Applied to the whole dataset it works better than expected! great job! – mrpls Sep 04 '18 at 13:17
  • @jazzurro yes there is a warning that I forget, `disappointed could not be fit on page. It will not be plotted`. Thanks a lot to point it out. – s__ Sep 04 '18 at 13:18
  • @mrpls great, have a good time with text mining, it's very cool to do! – s__ Sep 04 '18 at 13:19
0

Consider this approach.

library(flipTextAnalysis)
text.to.analyze <- input.phrases

# Converting the text to a vector
text.to.analyze <- as.character(text.to.analyze)

# Extracting the words from the text
library(flipTextAnalysis)
options <- GetTextAnalysisOptions(phrases = '', 
                                 extra.stopwords.text = 'amp',
                                 replacements.text = '',
                                 do.stem = TRUE,
                                 do.spell = TRUE)
text.analysis.setup <- InitializeWordBag(text.to.analyze, min.frequency = 5.0, operations = options$operations, manual.replacements = options$replacement.matrix, stoplist = options$stopwords, alphabetical.sort = FALSE, phrases = options$phrases, print.type = switch("Word Frequencies", "Word Frequencies" = "frequencies", "Transformed Text" = "transformations")) 

# Sentiment analysis of the phrases 
phrase.sentiment = SaveNetSentimentScores(text.to.analyze, check.simple.suffixes = TRUE, blanks.as.missing = TRUE) 
phrase.sentiment[phrase.sentiment >= 1] = 1
phrase.sentiment[phrase.sentiment <= -1] = -1

# Sentiment analysis of the words
td <- as.matrix(AsTermMatrix(text.analysis.setup, min.frequency = 1.0, sparse = TRUE))
counts <- text.analysis.setup$final.counts 
phrase.word.sentiment <- sweep(td, 1, phrase.sentiment, "*")
phrase.word.sentiment[td == 0] <- NA # Setting missing values to Missing
word.mean <- apply(phrase.word.sentiment,2, FUN = mean, na.rm = TRUE)
word.sd <- apply(phrase.word.sentiment,2, FUN = sd, na.rm = TRUE)
word.n <- apply(!is.na(phrase.word.sentiment),2, FUN = sum, na.rm = TRUE)
word.se <- word.sd / sqrt(word.n)
word.z <- word.mean / word.se
word.z[word.n <= 3 || is.na(word.se)] <- 0        
words <- text.analysis.setup$final.tokens
x <- data.frame(word = words, 
      freq = counts, 
      "Sentiment" = word.mean,
      "Z-Score" = word.z,
      Length = nchar(words))
word.data <- x[order(counts, decreasing = TRUE), ]

# Working out the colors
n = nrow(word.data)
colors = rep("grey", n)
colors[word.data$Z.Score < -1.96] = "Red"
colors[word.data$Z.Score > 1.96] =  "Green"

# Creating the word cloud
library(wordcloud2)
wordcloud2(data = word.data[, -3], color = colors, size = 0.4)

enter image description here

I really don't like Trump, but this illustrates the point nicely.

Also, see the two links below for additional ideas of how to handle these kinds of problems.

http://rstudio-pubs-static.s3.amazonaws.com/71296_3f3ee76e8ef34410a1635926f740c473.html

https://www.analyticsvidhya.com/blog/2017/03/measuring-audience-sentiments-about-movies-using-twitter-and-text-analytics/

ASH
  • 20,759
  • 19
  • 87
  • 200