I have a training dataset of 1600000 tweets. How can I train this type of huge data.
I have tried something using nltk.NaiveBayesClassifier
. It will take more than 5 days to train if I run.
def extract_features(tweet):
tweet_words = set(tweet)
features = {}
for word in featureList:
features['contains(%s)' % word] = (word in tweet_words)
return features
training_set = nltk.classify.util.apply_features(extract_features, tweets)
NBClassifier = nltk.NaiveBayesClassifier.train(training_set) # This takes lots of time
What should I do?
I need to classify my Dataset using SVM and naive bayes.
Dataset I want to use : Link
Sample(training Dataset):
Label Tweet
0 url aww bummer you shoulda got david carr third day
4 thankyou for your reply are you coming england again anytime soon
Sample(testing Dataset):
Label Tweet
4 love lebron url
0 lebron beast but still cheering the til the end
^
I have to predict Label 0/4 only
How can I train this huge dataset efficiently?