1

I saw very nice R script for sentiment score of each sentences, available at: sentiment.R and I was wondering, how could I replace this part

# split into words. str_split is in the stringr package
word.list = str_split(sentence, '\\s+')
# sometimes a list() is one level of hierarchy too much
words = unlist(word.list)

for matching multiple terms with pos and neg dictionaries with multiple terms. I have an example below.

I have following data.frame:

sent <- data.frame(words = c("just right size", "love this quality", 
                         "good quality", "very good quality", "i hate this notebook",
                         "great improvement", "notebook is not good","notebook was"), user = c(1,2,3,4,5,6,7,8))

                 words user
1      just right size    1
2    love this quality    2
3         good quality    3
4    very good quality    4
5 i hate this notebook    5
6    great improvement    6
7 notebook is not good    7
8         notebook was    8

Then I have dictionaties with pos and neg words:

posWord <- c("great","improvement","love","great improvement","very good","good","right","very")
negWords <- c("hate","bad","not good","horrible")

Desired output is below:

                 words user  SentimentScore
1      just right size    1               1
2    love this quality    2               1
3         good quality    3               1
4    very good quality    4               1
5 i hate this notebook    5              -1
6    great improvement    6               1
7 notebook is not good    7              -1
8         notebook was    8               0

How should I rewrite that code at github to have a desired output. I mean, if I use source code at github as it is, so e.g. in 4th row there'll be 2 instead of 1 in SentimentScore column.

Do anyone have any advice or similar solution for that, please. I'll appreciate any of your help. Thank you very much in advance.

martinkabe
  • 1,079
  • 2
  • 12
  • 27

1 Answers1

1

I didn't look at the library you mentioned. This may now be what you want. I created a data frame with the positive and negative words. I assigned them a -/+ 1 value. I then assigned them a length value to sort on. This way the longest word/phrase is used first.

 sent <- data.frame(words = c("just right size", "love this quality", 
                             "good quality", "very good quality", "i hate this notebook",
                             "great improvement", "notebook is not good"), user = c(1,2,3,4,5,6,7),
                             stringsAsFactors=F)

posWords <- c("great","improvement","love","great improvement","very good","good","right","very")
negWords <- c("hate","bad","not good","horrible")

wordsDF<- data.frame(words = posWords, value = 1,stringsAsFactors=F)
wordsDF<- rbind(wordsDF,data.frame(words = negWords, value = -1))
wordsDF$lengths<-unlist(lapply(wordsDF$words, nchar))
wordsDF<-wordsDF[ order(-wordsDF[,3]),]


scoreSentence <- function(sentence){
  score<-0
  for(x in 1:nrow(wordsDF)){
    count<-length(grep(wordsDF[x,1],sentence))
    if(count){
      score<-score + (count * wordsDF[x,2])
      sentence<-sub(wordsDF[x,1],'',sentence)
    }
  }
  score
}

SentimentScore<- unlist(lapply(sent$words, scoreSentence))
cbind(sent, SentimentScore)

Output

                 words user SentimentScore
1      just right size    1              1
2    love this quality    2              1
3         good quality    3              1
4    very good quality    4              1
5 i hate this notebook    5             -1
6    great improvement    6              1
7 notebook is not good    7             -1
SethB
  • 2,060
  • 1
  • 14
  • 18
  • Thanks for your purpose, but in 4th and 6th row you have 2 instead of 1 in SentimentScore column. – martinkabe Feb 16 '15 at 16:44
  • Definitely, I need to avoid strsplit(sentence, '\\s+'), because that split the text into unique words, so you are not able to do that multiple terms match. – martinkabe Feb 16 '15 at 16:45
  • I used grep and the word list against the sentence. I'm not sure what the rule is for multiple matches. I think you may be wanted to remove the word/words if a match is found? – SethB Feb 16 '15 at 16:52
  • I mean, e.g. for sentence 4: "very good quality" the overall evaluation would be only for multiple term "very good" and not for "very" and "good" separately. So the SentimentScore will be 1 instead of 2. – martinkabe Feb 16 '15 at 16:54
  • I really dont know, how it should work... I only need to have for e.g. "is not good" SentimentScore = -1 instead of +1, because there is match with good, which is positive word :-) – martinkabe Feb 16 '15 at 17:02
  • This is closer. I think the `posWords` and `negWords` will have to be ordered by word length and traversed together. – SethB Feb 16 '15 at 17:18
  • Yes, exactly, you're right... pos/negWords must be ordered by word length. If an algorithm run into longest word so the next words won't be examine... – martinkabe Feb 16 '15 at 17:24
  • And how could you traversed them together? Mean pos/negWords. – martinkabe Feb 16 '15 at 17:32
  • Seth, thank you very much, you are really great... Please, could I have another but the same task for you? – martinkabe Feb 16 '15 at 18:42
  • When I run your approach for 100 rows, it takes a very long time :-( Is there any possibility how to speed up your script? – martinkabe Feb 16 '15 at 19:53
  • I've tried it for 7000 pos/negWords and for 100 sent it took about 3 min. But I have 200.000 sent :-) so it could take very long time :-( – martinkabe Feb 16 '15 at 21:11