0

I am trying to the sentiment of a dataset of Tweets using the AFINN dictionary (get_sentiments("afinn"). A sample of the dataset is provided below:

A tibble: 10 x 2
   Date                TweetText                                                
   <dttm>              <chr>                                                    
 1 2018-02-10 21:58:19 "RT @RealSirTomJones: Still got the moves! That was a lo~
 2 2018-02-10 21:58:19 "Yass Tom \U0001f600 #snakehips still got it #TheVoiceUK"
 3 2018-02-10 21:58:19 Yasss tom he’s some chanter #TheVoiceUK #ItsNotUnusual   
 4 2018-02-10 21:58:20 #TheVoiceUK SIR TOM JONES...HE'S STILL HOT... AMAZING VO~
 5 2018-02-10 21:58:21 I wonder how many hips Tom Jones has been through? #TheV~
 6 2018-02-10 21:58:21 Tom Jones has still got it!!! #TheVoiceUK                
 7 2018-02-10 21:58:21 Good grief Tom Jones is amazing #TheVoiceuk              
 8 2018-02-10 21:58:21 RT @tonysheps: Sir Thomas Jones you’re a bloody legend #~
 9 2018-02-10 21:58:22 @ITV Tom Jones what a legend!!! ❤️ #StillGotIt #TheVoice~
10 2018-02-10 21:58:22 "RT @RealSirTomJones: Still got the moves! That was a lo~

What I want to do is: 1. Split up the Tweets into individual words. 2. Score those words using the AFINN lexicon. 3. Sum the score of all the words of each Tweet 4. Return this sum into a new third column, so I can see the score per Tweet.

For a similar lexicon I found the following code:

# Initiate the scoreTopic
scoreTopic <- 0
# Start a loop over the documents
for (i in 1:length (myCorpus)) {
  # Store separate words in character vector
  terms <- unlist(strsplit(myCorpus[[i]]$content, " "))
  # Determine the number of positive matches
  pos_matches <- sum(terms %in% positive_words)
  # Determine the number of negative matches
  neg_matches <- sum(terms %in% negative_words)
  # Store the difference in the results vector
  scoreTopic [i] <- pos_matches - neg_matches
} # End of the for loop

dsMyTweets$score <- scoreTopic

I am however not able to adjust this code to get it working with the afinn dictionary.

Tim
  • 1
  • 1
  • 3
  • Reading chapter 2 and chapter 7 of [tidytextmining](https://www.tidytextmining.com) should give you all the info you need. – phiver May 06 '18 at 14:45

2 Answers2

4

This would be a great use case for tidy data principles. Let's set up some example data (these are real tweets of mine).

library(tidytext)
library(tidyverse)

tweets <- tribble(
    ~tweetID, ~TweetText,
    1, "Was Julie helping me because I don't know anything about Python package management? Yes, yes, she was.",
    2, "@darinself OMG, this is my favorite.",
    3, "@treycausey @ftrain THIS IS AMAZING.",
    4, "@nest No, no, not in error. Just the turkey!",
    5, "The @nest people should write a blog post about how many smoke alarms went off yesterday. (I know ours did.)")

Now we have some example data. In the code below, unnest_tokens() tokenizes the text, i.e. breaks it up into individual words (the tidytext package allows you to use a special tokenizer for tweets) and the inner_join() implements the sentiment analysis.

tweet_sentiment <- tweets %>%
    unnest_tokens(word, TweetText, token = "tweets") %>%
    inner_join(get_sentiments("afinn"))
#> Joining, by = "word"

Now we can find the scores for each tweet. Take the original data set of tweets and left_join() on to it the sum() of the scores for each tweet. The handy function replace_na() from tidyr lets you replace the resulting NA values with zero.

tweets %>%
    left_join(tweet_sentiment %>%
                  group_by(tweetID) %>%
                  summarise(score = sum(score))) %>%
    replace_na(list(score = 0))
#> Joining, by = "tweetID"
#> # A tibble: 5 x 3
#>   tweetID TweetText                                                  score
#>     <dbl> <chr>                                                      <dbl>
#> 1      1. Was Julie helping me because I don't know anything about …    4.
#> 2      2. @darinself OMG, this is my favorite.                          2.
#> 3      3. @treycausey @ftrain THIS IS AMAZING.                          4.
#> 4      4. @nest No, no, not in error. Just the turkey!                 -4.
#> 5      5. The @nest people should write a blog post about how many …    0.

Created on 2018-05-09 by the reprex package (v0.2.0).

If you are interested in sentiment analysis and text mining, I invite you to check out the extensive documentation and tutorials we have for tidytext.

Julia Silge
  • 10,848
  • 2
  • 40
  • 48
0

For future reference:

Score_word <- function(x) {
  word_bool_vec <- get_sentiments("afinn")$word==x
  score <- get_sentiments("afinn")$score[word_bool_vec]
  return (score) }    

Score_tweet <- function(sentence) {
  words <- unlist(strsplit(sentence, " "))
  words <- as.vector(words)
  scores <- sapply(words, Score_word)
  scores <- unlist(scores)
  Score_tweet <- sum(scores)
  return (Score_tweet)
  }     

dsMyTweets$score<-apply(df, 1, Score_tweet)

This executes what I initially wanted! :)

Tim
  • 1
  • 1
  • 3