3

Recently, I have been working on a project which requires Sentiment analysis of twitter data. I am using a Naive Bayes Classifier from the Textblob library, and am trying to train it with 1.6 million tweets (which can be found here if anyone is wondering: https://www.kaggle.com/kazanova/sentiment140). Just outright passing in the 1.6 million tweets causes a Memory Error, so I decided to chunk it so only 1000 tweets get trained at a time. This has minor success as I can only get to about 10,000 tweets on my local machine until my computer freezes up, because I am using too much ram. I then tried it on Google colab, so I could run my code in the cloud. With both the TPU, and the GPU, the max I have gotten too is 28,000 tweets, before the session crashed and I had to restart the runtime. Here is my code:

with open("shuffledlist.pickle", 'rb') as f: #Loading in my list of 1.6 million tweets
    full_data = pickle.load(f)

training_data = (tweet for tweet in full_data[:1500000]) 

try:
    with open("sentimentclassifier.pickle", "rb") as file: #makes a new classifier if one doesnt exist
        classifier = pickle.load(file)
        print("Got existing classifier")
except EOFError:
    classifier = NaiveBayesClassifier(full_data[:1000])
    print("Made new classifier")
del full_data

feeding_size = 1000
left_splice = 0
right_splice = feeding_size + left_splice

count = 0
new_start_time = time.time()
past_times = 0

while right_splice < 1500000:
    loop_time = time.time()
    data = itertools.islice(training_data,left_splice,right_splice)
    try:
        classifier.update(data)
    except Exception:
        print("Houston we got a problem")
        with open("sentimentclassifier.pickle", "wb") as sentiment:
             pickle.dump(classifier, sentiment, protocol = -1)
        sys.exit("Yo it ended at {} and {}".format(left_splice, right_splice))
    past_times += time.time() - loop_time
    count += 1
    string = "Left: {} Right: {}. Took {} seconds. Total Time Elapsed: {}. Average Time for each: {}. Count: {}."\
        .format(left_splice, right_splice, time.time()-loop_time, time.time() - new_start_time, past_times/count, count)
    sys.stdout.write('\r' + string)
    left_splice += feeding_size
    right_splice += feeding_size
    with open("sentimentclassifier.pickle", "wb") as sentiment:
        pickle.dump(classifier, sentiment, protocol = -1)
        print("Done dumping cycle {}!".format(count))

print("Done! Right: {}, Left: {}!".format(left_splice, right_splice))

with open("sentimentclassifier.pickle", "wb") as sentiment:
    pickle.dump(classifier, sentiment, protocol = -1)


print("Training took {} seconds!".format(time.time()-new_start_time))

Some notes:

  • Since my primary problem is how big my sentimentclassifier.pickle file gets, I have tried using gzip, but it just takes way too long to open and close the file, and this is especially bad because I need to open the file every loop since I do not want to lose any progress if the program crashes.

  • I switched from using lists to using generators, which did improve the speed quite significantly.

  • In google colab I tried passing in 10,000 at a time, which was sort of a last ditch effort, and unsurprisingly, it did not work out for the best.

  • I am not sure if nltk's Naive Bayes Classifier is more efficient, but I really want that to be a last resort, as reformatting my list of tweets may take a few hours. But if it really is more efficient I will happily redo my code if it means I can get this working.

Thank you and any advice will be greatly appreciated!

KoderKaran
  • 31
  • 1

1 Answers1

0

I'd try splitting the training data into manageable chunks in separate training files. Instead of opening a file of 1.5 million tweets, split that file into a few thousand tweets per file. Make a loop to train on one file, close that file, then train on the next. That way you're not loading so much into ram at once. It has to hold the entire 1.5 million tweets in memory the way you're doing it. That's bogging down your RAM.

To be clear:

This:

with open("shuffledlist.pickle", 'rb') as f: #Loading in my list of 1.6 million tweets full_data = pickle.load(f)

    >training_data = (tweet for tweet in full_data[:1500000]) 

is loading the entire file into RAM. So, chunking after this line isn't lessening the amount of ram you're using, though it is decreasing the amount of GPU cores you're utilizing because you're only feeding the GPU x amount per iteration. If you load a smaller file in the first place, you'll decrease the amount of RAM utilized from the get-go.

  • It seems like the main thing bogging down my RAM is the classifier itself, as after about 28,000 tweets, it become 3 GB. But I will try this idea as it won't hurt. I'll try it out when I am home and update you. – KoderKaran Feb 16 '20 at 17:11
  • The classifier is on your hard disk. There's no need to load the whole file into ram just to update it with each run. Trying to put it simply here... imagine you have a text file as a list of algorithms for a script to run. Like, old-style machine learning. You create rules to derive new algorithms, then you want to update the algorithm text file. Open("algorithms","a") as algorithms: isn't going to load the entire text file. It's just going to open it and then when you write to it, you just add to what's in there. 5 gigs or 5kb, it doesn't matter. – Pete Marise Feb 16 '20 at 17:23
  • Here's a link to what you're trying to do: https://medium.com/@mrgarg.rajat/training-on-large-datasets-that-dont-fit-in-memory-in-keras-60a974785d71 – Pete Marise Feb 16 '20 at 17:24
  • 1
    Ohh so I thought my classifier was taking up the ram but it is actually my dataset? And chunking the dataset itself should fix the problem? I just have one question about this. Why is it then that my program isnt slow with the first 1000 tweets, but it freezes my computer after 10,000? – KoderKaran Feb 16 '20 at 17:58
  • If left slice stays at zero, then the data set it's loading into the gpu is increasing with each iteration. Maybe I just need more coffee... I don't know... but I don't see any change happening to left slice, while right slice gets higher and higher with each run. You could fix that by increasing left slice by feed size with each run, but ultimately, it's kinder to your RAM and your processor to just chunk the training data and run in smaller batches. – Pete Marise Feb 16 '20 at 18:13
  • My left splice also increases every loop, by the same about that the right splice increases by. – KoderKaran Feb 16 '20 at 18:16
  • Doesn't the fact that I keep adding to my classifier by training it and I am holding it memory mean that the classifier is the problem? – KoderKaran Feb 16 '20 at 18:19
  • Where are you holding the classifier? You're assigning a variable that points to the disk file and adding to it with .update(). Where does it update left slice in this code? Itertools? Could be a combination of holding the entire training set in ram and then also adding to ram with each run by overtaxing the GPU. I'm not sure why it gets progressively slower, unless it's the left slice thing or the ram being borrowed by the processor :/ – Pete Marise Feb 16 '20 at 18:24
  • I change left splice in the line right above right splice, but it is pretty hard to spot lol. I get my classifier by loading it in and setting variable "classifier" set to it. It looks like this: `classifier = pickle.load(file)`, where file is the pickled classifier from previous training. I am using itertools because apparently it is much more memory efficient than lists. I definitely think you are right about the way I am holding my training set being a part of the problem though. – KoderKaran Feb 16 '20 at 18:33
  • Okay... far as I understand pickle, it's not loading the data itself into the ram. .load() deserializes the "pickled" data so you can read or write to the actual pickled file that's stored on disk. Near as I can tell, that shouldn't cause an issue. If I'm missing something, you could just create a new pickle for each run, then combine the pickles at the end of the program. That way, if it IS the classifier, you can classify in smaller batches as well. But yeah, I mean... yeah that makes sense that the classifier is taking up a lot of processing RAM as it gets larger – Pete Marise Feb 16 '20 at 18:41
  • Thanks! I am gonna try out your suggestions later tonight when I get home. Will update here! Thanks again for the time and help! – KoderKaran Feb 16 '20 at 18:45