How to extract individual words from sentence and match them with words from pos and neg dictionaries in R

Question

I need to create a function in R language, which gonna be able to cut sentence into words and then these words match with words in pos and neg dictionaries. This may resulted into Sentiment Score - for possitive words in sentence is equal to 1 and for negative words in sentence is equal to -1.

Product_ID        Sentence        Attribute        SentimentScore
1111111              1            graphics                1
1111111              1            windows                 1
1111111              2            loads                  -1
2222222              1            laptops                -1
2222222              2            design                  1

First sentence for product 1111111 may seems like: ... this product... great graphics... works good on my windows.

Eg. Dictionary with possitive words (pos.txt) looks like: a+ abound abounds abundance abundant accessable accessible acclaim acclaimed ... and so forth

and dictionary with negative words (neg.txt) looks like: 2-faced 2-faces abnormal abolish abominable abominably abominate abomination abort aborted aborts ... and so forth

I saw a function called score.sentiment at: gitHub, but it evaluating all the sentence using of difference between pos and neg words in each sentence. I need something very similar, but for individual words.

I really very appreciate any of your help. Thanks a lot in forward.

Can you provide the sentences? THis seems to be a tokenizing and matching task. — Tyler Rinker, Jan 21 '15 at 16:52
1st user: Good printer for the money. Wireless setup was surprisingly easy. — martinkabe, Jan 21 '15 at 16:59
2nd: Very good laptop! Worth the price too! Amazing and user friendly 3rd: This is a pretty decent laptop/tablet . The picture resolution is amazing! Good price for what you get. As good as iPad at better price. — martinkabe, Jan 21 '15 at 17:00

score 0 · Answer 1 · answered Jan 21 '15 at 16:56

0

Will that suit your needs?

pos = c("abound" , "abounds", "abundant")
neg = c("2-face","abnormal")

sent = "abundant abnormal activity was due to 2-face people"

p = 0
for (i in 1:length(pos)) {
  if (grepl(pos[i],sent,ignore.case=T) == TRUE) p = p + 1  
}

n = 0
for (i in 1:length(neg)) {
  if (grepl(neg[i],sent,ignore.case=T) == TRUE) n = n + 1  
}

print(p)
print(n)
print(paste("Overall sentence sentiment score = ", p - n))

Result: positive 1, negative 2, overall -1

answered Jan 21 '15 at 16:56

Alexey Ferapontov

5,029
4
22
39

I need the same output as I showed in table above. – martinkabe Jan 21 '15 at 17:05
Cut individual words in sentence and match them to words in dictionary, print them and for pos word 1 and for neg word -1. These values print to another column. – martinkabe Jan 21 '15 at 17:07
`words = unlist(str_split(sent," ")) for (i in 1:length(words)) { for (j in 1:length(pos)){ if (words[i] == pos[j]) print(paste(words[i],1)) } for (k in 1:length(neg)){ if (words[i] == neg[k]) print(paste(words[i],-1)) } }` – Alexey Ferapontov Jan 21 '15 at 17:16
Result is: [1] "abundant 1" [1] "abnormal -1" [1] "2-face -1" – Alexey Ferapontov Jan 21 '15 at 17:18
If I have e.g. sent = data.frame(Sentences=c("abundant abnormal activity was due to 2-face people","abundant abnormal activity was due to 2-face people"), user = c(1,2)) Could you cut these words then evaluate and relate to particular user, please... I mean grouping variable would be user. – martinkabe Jan 21 '15 at 17:25
Brute force would be to create another outer for statement and loop over user first, then words in sentence for neg and pos for each user. Probably there is a better and more elegant solution that won't need for loops but uses vectorized variables – Alexey Ferapontov Jan 21 '15 at 17:30
How could I do that. Could you please provide me some solution of that, please? – martinkabe Jan 21 '15 at 18:02

Alexey Ferapontov · Accepted Answer · 2015-01-21T19:38:08.667

0

Brute force approach. Not optimal, as uses too many for loops, but seems to be doing what you need. Hopefully, this should be suitable for your application. You can rearrange things or store results in another variable so that the output is w/o [1] [1] etc.

Code:

sent = data.frame(Sentences=c("abundant bad abnormal activity was due to 2-face people","strange exciting activity was due to 2-face people"), user = c(1,2)) 
pos = c("abound" , "abounds", "abundant", "exciting")
neg = c("2-face","abnormal", "strange", "bad", "weird")

words <- matrix(ncol = 2,nrow=8)

words = (str_split(unlist(sent$Sentences)," "))

tmp <- data.frame()
tmn <- data.frame()

for (i in 1:nrow(sent)) {
  for (j in 1:length(words)) {
    for (k in 1:length(pos)){
      if (words[[i]][j] == pos[k]) {
        print(paste(i,words[[i]][j],1))
        tmn <- cbind(i,words[[i]][j],1)
        tmp <- rbind(tmp,tmn)
      }
    }
    for (m in 1:length(neg)){
      if (words[[i]][j] == neg[m]) { 
        print(paste(i,words[[i]][j],-1))
        tmn <- cbind(i,words[[i]][j],-1)
        tmp <- rbind(tmp,tmn)
      }
    }  
  }
}

View(tmp)

Result:

    i   V2         V3
1   1   abundant    1
2   1   bad        -1
3   2   strange    -1
4   2   exciting    1

edited Jan 21 '15 at 19:38

answered Jan 21 '15 at 18:34

Alexey Ferapontov

5,029
4
22
39

That's great. Thats exactly what I was looking for. How could I store the results into data frame or matrix nx3? – martinkabe Jan 21 '15 at 18:57
Please see above. This is a fast-"invented" solution on a lunch break and is by far not optimal. But it works. Be careful with large datasets, as it uses 3 nested for loops. Should be fine for small ones – Alexey Ferapontov Jan 21 '15 at 19:39
That's perfect, thank you very much, you helped me a lot. I'll try to figure out, how it works and then rewrite for large data sets solution, because I need it for big data implementation. – martinkabe Jan 21 '15 at 19:57
Something is wrong, because if I run your code, it skips some words in dicrionary: sent1 = data.frame(Sentences=c("abundant bad abnormal activity was due to 2-face people","strange exciting activity was due to great 2-face people"), user = c(1,2)) pos1 = c("abound" , "abounds", "abundant", "exciting", "great") neg1 = c("2-face","abnormal", "strange", "bad", "weird") It produces only: [1] "1 abundant 1" [1] "1 bad -1" [1] "2 strange -1" [1] "2 exciting 1" – martinkabe Jan 22 '15 at 12:17
Please, could you write a better solution for that. – martinkabe Jan 22 '15 at 12:20
Can you post your exact code here? May be there's a typo somewhere – Alexey Ferapontov Jan 22 '15 at 13:31
See below... that's what I tried. – martinkabe Jan 22 '15 at 14:59
So, my part of the code worked? Not sure I understood your comment – Alexey Ferapontov Jan 22 '15 at 15:05
No, unfortunately doesn't. If you see print(tmp), so it seems to skip some of the words. – martinkabe Jan 22 '15 at 15:12
E.g. great and 2-face is missing. – martinkabe Jan 22 '15 at 15:14
Of course! My bad. This should be the correct line: ` for (j in 1:length(words[[i]])) { ' – Alexey Ferapontov Jan 22 '15 at 15:20
Yes, it works :-) thanks a lot. Please, do you happen to know, how do I get one word on left and one word on right of matched word? – martinkabe Jan 22 '15 at 15:35
Sure thing. If you unlisted the words as above, when you have a matched word say 'words[[i]][2]' ask for words[[i]][1] and [3] - this will be per user [i] – Alexey Ferapontov Jan 22 '15 at 15:53
Please, could you add it into your source code. Lets say we have one word on left and one word on the right side of searched word. This is above my r programming skills. The best approach would be to write both words into separated columns related to the seached one. – martinkabe Jan 22 '15 at 15:58
tmn <- cbind(i,paste(words[[i]][j-1],words[[i]][j],words[[i]][j+1],sep=" "),1) – Alexey Ferapontov Jan 22 '15 at 16:04
Great work, thanks a lot, your job helps me a lot ! ! – martinkabe Jan 22 '15 at 16:26

score 0 · Answer 3 · answered Jan 22 '15 at 14:58

sent1 = data.frame(Sentences=c("abundant bad abnormal activity was due to 2- face people","strange exciting activity was due to great 2-face people"), user = c(1,2)) 
pos1 = c("abound" , "abounds", "abundant", "exciting", "great")
neg1 = c("2-face","abnormal", "strange", "bad", "weird")

Then I used:

words = (str_split(unlist(sent1$Sentences)," "))

tmp <- data.frame()
tmn <- data.frame()

for (i in 1:nrow(sent1)) {
   for (j in 1:length(words)) {
    for (k in 1:length(pos1)){
     if (words[[i]][j] == pos1[k]) {
    print(paste(i,words[[i]][j],1))
    tmn <- cbind(i,words[[i]][j],1)
    tmp <- rbind(tmp,tmn)
  }
}
for (m in 1:length(neg1)){
  if (words[[i]][j] == neg1[m]) { 
    print(paste(i,words[[i]][j],-1))
    tmn <- cbind(i,words[[i]][j],-1)
    tmp <- rbind(tmp,tmn)
      }
    }  
  }
 }

That resulted into:

print(tmp)
  i       V2 V3
1 1 abundant  1
2 1      bad -1
3 2  strange -1
4 2 exciting  1

If I do something like that:

sent1$Sentences <- as.character(sent1$Sentences)
List <- strsplit(sent1$Sentences, " ")
a <- data.frame(Id=rep(sent1$user, sapply(List, length)),    Words=unlist(List))
a$Words <- as.character(a$Words)
a[a$Words %in% pos1,]

resulted into possitive:

Id    Words
1 abundant
2 exciting
2    great

and negative: a[a$Words %in% neg1,]

Id    Words
1      bad
1 abnormal
1   2-face
2  strange
2   2-face

But I need to add value 1 for possitive and -1 for negative words.

How to extract individual words from sentence and match them with words from pos and neg dictionaries in R

3 Answers3

Linked