sentiment analysis with different number of documents

Question

I am trying to do sentiment analysis on newspaper articles and track the sentiment level across time. To do that, basically I will identify all the relevant news articles within a day, feed them into the polarity() function and obtain the average polarity scores of all the articles (more precisely, the average of all the sentence from all the articles) within that day.

The problem is, for some days, there will be many more articles compared to other days, and I think this might mask some of the info if we simply track the daily average polarity score. For example, a score of 0.1 from 30 news articles should carry more weight compared to a score of 0.1 generated from only 3 articles. and sure enough, some of the more extreme polarity scores I obtained came from days whereby there are only few relevant articles.

Is there anyway I can take the different number of articles each day into consideration?

library(qdap)
sentence = c("this is good","this is not good")
polarity(sentence)

You mean can you weight the polarity scores of each day by the number of articles that day? — lawyeR, Jan 21 '15 at 01:19
Yes, is that an acceptable practice? And if so, what is a good weighting factor? — Seamus Lam, Jan 21 '15 at 01:22
It may be useful to say where this `polarity` function comes from as well as give a minimal working example MWE: http://stackoverflow.com/help/mcve — Tyler Rinker, Jan 21 '15 at 01:23
added a simplified example to show where the function came from. — Seamus Lam, Jan 21 '15 at 01:30

Tyler Rinker · Accepted Answer · 2015-01-21T13:09:19.000

I would warn that sometimes saying something strong with few words may pack the most punch. Make sure what you're doing makes sense in terms of your data and research questions.

One approach would be to use number of words as in the following example (I like the first approach moreso here):

poldat2 <- with(mraja1spl, polarity(dialogue, list(sex, fam.aff, died)))

output <- scores(poldat2)
weight <- ((1 - (1/(1 + log(output[["total.words"]], base = exp(2))))) * 2) - 1
weight <- weigth/max(weight)
weight2 <- output[["total.words"]]/max(output[["total.words"]])

output[["weighted.polarity"]] <- output[["ave.polarity"]] * weight   
output[["weighted.polarity2"]] <- output[["ave.polarity"]] * weight2   
output[, -c(5:6)]


##    sex&fam.aff&died total.sentences total.words ave.polarity weighted.polarity weighted.polarity2
## 1       f.cap.FALSE             158        1641        0.083       0.143583793        0.082504197
## 2        f.cap.TRUE              24         206        0.044       0.060969157        0.005564434
## 3       f.mont.TRUE               4          29        0.079       0.060996614        0.001397106
## 4       m.cap.FALSE              73         651        0.031       0.049163984        0.012191207
## 5        m.cap.TRUE              17         160       -0.176      -0.231357933       -0.017135804
## 6     m.escal.FALSE               9         170       -0.164      -0.218126656       -0.016977931
## 7      m.escal.TRUE              27         590       -0.067      -0.106080866       -0.024092720
## 8      m.mont.FALSE              70         868       -0.047      -0.078139272       -0.025099276
## 9       m.mont.TRUE             114        1175       -0.002      -0.003389105       -0.001433481
## 10     m.none.FALSE               7          71        0.066       0.072409049        0.002862997
## 11  none.none.FALSE               5          16       -0.300      -0.147087026       -0.002925046

sentiment analysis with different number of documents

1 Answers1