0

The program being developed is inputting text and outputting a vector (the document) with sentences and words as rows and columns where words are given a numerical value depending on their sentiment. Functional words (like the, was, were) will be given the value of 0.01. There is a database behind this program where words have numerical values depending on their polarity positive/negative. This database will give the word a prior polarity wich may change depending on its contextual polarity. The problem to be resolved is what range of numerical values to be given to the words in the database.

  • 1
    I think you should play a little with your dataset and manually fine tune your algorithm to find out this range – Leo Mar 07 '14 at 12:01
  • I agree with Leo: choose an initial set based on gut feeling and start fine tuning as actual data comes in. – nablex Mar 07 '14 at 12:43

2 Answers2

1

I think that a crude way to think about it is to see the maximum words that you may have (approximately) and the difference between two numerical values that you want to have. For example with the range going from -1 to 1 and resolution of 0.01 you can have a maximum of (1 - (-1)) / 0.01 = 2/0.01 = 200 words. I hope you get the point.

So to have a collection of 1000 positive words and 500 negative words with numerical resolution of 0.01 your range has to be -(500 * 0.01) to (1000 * 0.01) = -5 to 10.

I hope that I have understood your question properly.

A word of caution: When using double/float remember that for numerical computing finite precision is used, for eg, 0.01 will not be exactly saved as 0.01 so you must never use == in your code for comparison, it must be >= or <=, you may have to tweak your logic to achieve this sometimes.

Sourabh Bhat
  • 1,793
  • 16
  • 21
0

I mean if you already set basic words to .01 Why don't you just give the words a point value based on length. The hard part would be getting rid of all the common words.

crychair
  • 337
  • 3
  • 20