5

Objective: To classify each tweet as positive or negative and write it to an output file which will contain the username, original tweet and the sentiment of the tweet.

Code:

import re,math
input_file="raw_data.csv"
fileout=open("Output.txt","w")
wordFile=open("words.txt","w")
expression=r"(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"

fileAFINN = 'AFINN-111.txt'
afinn = dict(map(lambda (w, s): (w, int(s)), [ws.strip().split('\t') for ws in open(fileAFINN)]))

pattern=re.compile(r'\w+')
pattern_split = re.compile(r"\W+")
words = pattern_split.split(input_file.lower())
print "File processing started"
with open(input_file,'r') as myfile:
for line in myfile:
    line = line.lower()

    line=re.sub(expression," ",line)
    words = pattern_split.split(line.lower())
    sentiments = map(lambda word: afinn.get(word, 0), words)
    #print sentiments
    # How should you weight the individual word sentiments?
    # You could do N, sqrt(N) or 1 for example. Here I use sqrt(N)
    """
    Returns a float for sentiment strength based on the input text.
    Positive values are positive valence, negative value are negative valence.
    """
    if sentiments:
        sentiment = float(sum(sentiments))/math.sqrt(len(sentiments))
        #wordFile.write(sentiments)
    else:
        sentiment = 0
    wordFile.write(line+','+str(sentiment)+'\n')
fileout.write(line+'\n')
print "File processing completed"

fileout.close()
myfile.close()
wordFile.close()

Issue: Apparently the output.txt file is

abc some tweet text 0
bcd some more tweets 1
efg some more tweet 0

Question 1: How do I add a comma between the userid tweet-text sentiment? The output should be like;

 abc,some tweet text,0
 bcd,some other tweet,1
 efg,more tweets,0

Question 2: The tweets are in Bahasa Melayu (BM) and the AFINN dictionary that I am using is of English words. So the classification is wrong. Do you know any BM dictionary that I can use?

Question 3: How do I pack this code in a JAR file?

Thank you.

mnm
  • 1,962
  • 4
  • 19
  • 46
  • Can you please provide some more information? You are giving us the output of `sentiments.txt` but none of your code writes to `sentiments.txt` so I am not sure what format you expect it to be in. Additionally you should have a `wordFile.close` at the end of your code. – Kristy Hughes Aug 05 '15 at 09:06
  • @Kristy Hughes Thanks for pointing out the anomaly. I have updated the original post. The file sentiments.txt has been replaced with output.txt and have closed the wordFile at the end. – mnm Aug 05 '15 at 09:31
  • Maybe you can use smileys to create a training set of tweets and then implement a Naive Bayes Classifier? – clemtoy Aug 16 '15 at 14:28
  • @clemtoy that is a real good suggestion. I had it in mind since the inception of this idea. But you see the fundamental problem is I'm still new to the programming paradigm as well as data mining algorithms. That is why I chose to break this problem into sub-parts and chose to solve them one at a time which would thus contribute to my learning. – mnm Aug 17 '15 at 00:43

1 Answers1

1

Question 1:

output.txt is currently simply composed of the lines you are reading in because of fileout.write(line+'\n'). Since it is space separated, you can separate the line pretty easily

line_data = line.split(' ') # Split the line into a list, separated by spaces
user_id = line_data[0] # The first element of the list
tweets = line_data[1:-1] # The middle elements of the list
sentiment = line_data[-1] # The last element of the list
fileout.write(user_id + "," + " ".join(tweets) + "," + sentiment +'\n')

Question 2: A quick google search gave me this. Not sure if it has everything you will need though: https://archive.org/stream/grammardictionar02craw/grammardictionar02craw_djvu.txt

Question 3: Try Jython http://www.jython.org/archive/21/docs/jythonc.html

Kristy Hughes
  • 586
  • 1
  • 6
  • 10
  • 1
    Thanks for your response. In your code fileout.write(user_id + "," + tweets + "," + sentiment +'\n') throws a TypeError: cannot concatenate 'str' and 'list' objects which I corrected it as wordFile.write(str(user_id) + ',' + str(tweets) + ',' + str(sentiment) +'\n') And the output of executing your code is like mysuara,1mdb,['jawab', 'tuduhan', 'tun', 'm', 'isu', 'dana', "'hilang'"],0.0 which is not how I want. – mnm Aug 07 '15 at 02:01
  • Right, whoops forgot that `tweets` was a list. Use `.join()` instead of `str()`. The syntax of join is `separator.join(list)`. So since you want it space separated, you want `" ".join(tweets)`. Updated my answer to reflect this. – Kristy Hughes Aug 08 '15 at 01:44