Recently, I have been working on a project which requires Sentiment analysis of twitter data. I am using a Naive Bayes Classifier from the Textblob library, and am trying to train it with 1.6 million tweets (which can be found here if anyone is wondering: https://www.kaggle.com/kazanova/sentiment140). Just outright passing in the 1.6 million tweets causes a Memory Error, so I decided to chunk it so only 1000 tweets get trained at a time. This has minor success as I can only get to about 10,000 tweets on my local machine until my computer freezes up, because I am using too much ram. I then tried it on Google colab, so I could run my code in the cloud. With both the TPU, and the GPU, the max I have gotten too is 28,000 tweets, before the session crashed and I had to restart the runtime. Here is my code:
with open("shuffledlist.pickle", 'rb') as f: #Loading in my list of 1.6 million tweets
full_data = pickle.load(f)
training_data = (tweet for tweet in full_data[:1500000])
try:
with open("sentimentclassifier.pickle", "rb") as file: #makes a new classifier if one doesnt exist
classifier = pickle.load(file)
print("Got existing classifier")
except EOFError:
classifier = NaiveBayesClassifier(full_data[:1000])
print("Made new classifier")
del full_data
feeding_size = 1000
left_splice = 0
right_splice = feeding_size + left_splice
count = 0
new_start_time = time.time()
past_times = 0
while right_splice < 1500000:
loop_time = time.time()
data = itertools.islice(training_data,left_splice,right_splice)
try:
classifier.update(data)
except Exception:
print("Houston we got a problem")
with open("sentimentclassifier.pickle", "wb") as sentiment:
pickle.dump(classifier, sentiment, protocol = -1)
sys.exit("Yo it ended at {} and {}".format(left_splice, right_splice))
past_times += time.time() - loop_time
count += 1
string = "Left: {} Right: {}. Took {} seconds. Total Time Elapsed: {}. Average Time for each: {}. Count: {}."\
.format(left_splice, right_splice, time.time()-loop_time, time.time() - new_start_time, past_times/count, count)
sys.stdout.write('\r' + string)
left_splice += feeding_size
right_splice += feeding_size
with open("sentimentclassifier.pickle", "wb") as sentiment:
pickle.dump(classifier, sentiment, protocol = -1)
print("Done dumping cycle {}!".format(count))
print("Done! Right: {}, Left: {}!".format(left_splice, right_splice))
with open("sentimentclassifier.pickle", "wb") as sentiment:
pickle.dump(classifier, sentiment, protocol = -1)
print("Training took {} seconds!".format(time.time()-new_start_time))
Some notes:
Since my primary problem is how big my sentimentclassifier.pickle file gets, I have tried using gzip, but it just takes way too long to open and close the file, and this is especially bad because I need to open the file every loop since I do not want to lose any progress if the program crashes.
I switched from using lists to using generators, which did improve the speed quite significantly.
In google colab I tried passing in 10,000 at a time, which was sort of a last ditch effort, and unsurprisingly, it did not work out for the best.
I am not sure if nltk's Naive Bayes Classifier is more efficient, but I really want that to be a last resort, as reformatting my list of tweets may take a few hours. But if it really is more efficient I will happily redo my code if it means I can get this working.
Thank you and any advice will be greatly appreciated!