How to train large Dataset for classification

Question

I have a training dataset of 1600000 tweets. How can I train this type of huge data.

I have tried something using nltk.NaiveBayesClassifier. It will take more than 5 days to train if I run.

def extract_features(tweet):

    tweet_words = set(tweet)
    features = {}
    for word in featureList:
        features['contains(%s)' % word] = (word in tweet_words)

    return features


training_set = nltk.classify.util.apply_features(extract_features, tweets)

NBClassifier = nltk.NaiveBayesClassifier.train(training_set)  # This takes lots of time

What should I do?

I need to classify my Dataset using SVM and naive bayes.

Dataset I want to use : Link

Sample(training Dataset):

Label     Tweet
0         url aww bummer you shoulda got david carr third day
4         thankyou for your reply are you coming england again anytime soon

Sample(testing Dataset):

Label     Tweet
4         love lebron url
0         lebron beast but still cheering the til the end
^
I have to predict Label 0/4 only

How can I train this huge dataset efficiently?

Use `scikit-learn` and try out `panda`. 1.6 million is not that much. Given that the vocabulary would have been ~1 million. And also remove singleton — alvas, Jan 14 '15 at 23:05
Take a look at http://scikit-learn.org/stable/tutorial/ and https://github.com/EducationalTestingService/skll and http://pandas.pydata.org/ — alvas, Jan 14 '15 at 23:07
You might also try [dimension](http://ufldl.stanford.edu/wiki/index.php/PCA) [reduction](http://ufldl.stanford.edu/wiki/index.php/Whitening) to capture some high percentage of the variance of the data. Not sure how well it works for large, sparse feature vectors like these though. — senderle, Jan 14 '15 at 23:33
could you post the data up somewhere on gdrive or something, then possibly we can try and find a solution for you. — alvas, Jan 15 '15 at 10:41
@alvas, sorry for late.. http://goo.gl/oqStKr This is the link of Dataset. I also provide it with my post.. Thank you.. — Shahriar, Jan 27 '15 at 12:08
What classes are you trying to predict? Is it related to the number at the beginning of the tweet? — James Pringle, Jan 30 '15 at 19:11
yes.. number at the beginning of the tweet is category @JamesPringle — Shahriar, Jan 30 '15 at 19:41
I am curious, then, why does `training.csv` have only two categories (0 and 4), while `testing.csv` has three categories (0, 2, and 4)? Seems to me that from training it would be impossible to produce a 2 as a prediction. — James Pringle, Jan 30 '15 at 20:26
2 is for neutral.. To check how many neutral tweets are considered as positive and negative.. Its my opinion.. Actually I will ignore it.. — Shahriar, Jan 30 '15 at 23:14
Do you have to use Naive Bayes or does it not matter as long as the trained model is accurate enough? — runDOSrun, Jan 31 '15 at 15:23
Can you please edit your post, properly explaining what you want to predict based on what? Looking at your csv I'm not able to immediately understand your features and labels. — runDOSrun, Jan 31 '15 at 15:32
@runDOSrun, I editted my post. tnx. SVM will be perfect for me actually.. — Shahriar, Jan 31 '15 at 16:48
Another question: Does it really matter how long it trains? Do you want to keep training your model on a daily basis with new data automatically or just *once* and then be done with it? — runDOSrun, Feb 01 '15 at 13:51
And: I am very unsure about your split between training and test data. test has 500 samples and training more than a million. This seems like teaching someone astrophysics only to ask him about "1+1=?" afterwards. How did you come to these sizes? — runDOSrun, Feb 01 '15 at 13:55

score 4 · Answer 1 · answered Feb 01 '15 at 15:10

4

Before speeding up the training I'd personally make sure that you actually need to. While not a direct answer to your question, I'll try to provide a different angle which you might or might not be missing (hard to tell from your initial post).

Take e.g. superbly's implementation as a baseline. 1.6Mio training and 500 test samples with 3 features yields 0.35 accuracy.

Using the exact same setup, you can go as low as 50k training samples without losing accuracy, in fact the accuracy will slightly go up - probably because you are overfitting with that many examples (you can check this running his code with a smaller sample size). I'm pretty sure that using a neural network at this stage would give horrible accuracy with this setup (the SVM can be kinda tuned to overcome overfitting though that's not my point).

You wrote in your initial post that you have 55k features (which you deleted for some reason?). This number should correlate with your training set size. Since you didn't specify your list of features it's not really possible to give you a proper working model or test my assumption.

However, I highly suggest that you reduce your training data as a first step and see a) how well you perform and b) at which point possible overfitting occurs. I would also adjust the test size to be of a higher size. 500-1.6Mio is kind of a weird split of the sets. Try 80/20% for train/test. As a third step check your feature list size. Is it representative of what you need? If there's unnecessary/duplicate features in that list, you should consider pruning.

As a final thought, if you come back to longer training sizes (e.g. because you decide that you do in fact need much more data than provided now), consider if slow learning really is an issue (besides testing your model). Many state-of-the-art classifiers are trained for days/weeks using GPU computing. Training time doesn't matter in that case because they're only trained once and possibly only updated with small batches of data when they "go online".

answered Feb 01 '15 at 15:10

runDOSrun

10,359
7
47
57

What is the accuracy if you ignore test tweets with label 2? – Shahriar Feb 01 '15 at 15:11
Thank you, I will try splitting 80/20% for train/test and will inform you . – Shahriar Feb 01 '15 at 15:24
If I do that it goes up from 0.36 to 0.5 (test size 369, train 50k, 3 features, SVM, class 0 and 4 are split 50/50). Using a training size of 6k it's still 0.5 indicating the problem I talked about. You should also definitely "test" with your training data to see at which point you reach 100% or the error converges - stop training at exactly that point as any more training will produce the same or worse results. – runDOSrun Feb 01 '15 at 15:25
how did you select this 3 features..? I thought all unique words will be features – Shahriar Feb 01 '15 at 15:32
could you please tell me what will be accuracy if you split training Dataset 80/20% into training set and test set? – Shahriar Feb 01 '15 at 15:35
I just randomly took superbly's 3 features to have a baseline. All unique words *can* be features but it depends on what exactly you're trying to accomplish and thus which algorithm you choose to train. Using scikit you can use bag of words and sparse representation: http://scikit-learn.org/dev/modules/feature_extraction.html#text-feature-extraction – runDOSrun Feb 01 '15 at 15:43
This is new Dataset link http://goo.gl/VRttrT . I used 40k training and 10k testing tweets. and also 2k features. But accuracy is 0.50. This is low i think – Shahriar Feb 01 '15 at 16:24
Yes, 0.5 is random and as that quite bad. You need to find out where the error on the *training set* converges. The test accuracy might very well go from 0 to 1 and from 1 back to 0.5-0. Unless you test the parameters extensively it's hard to tell if your 0.5 is "on the way up to 1.0" or "way down from 1.0". If it's on the "way down", you're training too much. – runDOSrun Feb 01 '15 at 16:38
How can I check this? `"on the way up to 1.0` or `way down from 1.0` ? I have no idea how to do this? How can i prevent when i'm starting way down from 1 – Shahriar Feb 01 '15 at 16:53
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/70002/discussion-between-aerofoil-kite-and-rundosrun). – Shahriar Feb 01 '15 at 16:55

score 4 · Accepted Answer · answered Feb 01 '15 at 21:11

Following what superbly proposed about the features extraction you could use the tfidvectorizer in scikit library to extract the important words from the tweets. Using the default configuration, coupled with a simple LogisticRegression it gives me 0.8 accuracy.Hope that helps. Here is an example on how to use it for you problem:

    train_df_raw = pd.read_csv('train.csv',header=None, names=['label','tweet'])
test_df_raw = pd.read_csv('test.csv',header=None, names=['label','tweet'])
train_df_raw =  train_df_raw[train_df_raw['tweet'].notnull()]
test_df_raw =  test_df_raw[test_df_raw['tweet'].notnull()]
test_df_raw =  test_df_raw[test_df_raw['label']!=2]

y_train = [x if x==0 else 1 for x in train_df_raw['label'].tolist()]
y_test = [x if x==0 else 1 for x in test_df_raw['label'].tolist()]
X_train = train_df_raw['tweet'].tolist()
X_test = test_df_raw['tweet'].tolist()

print('At vectorizer')
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
print('At vectorizer for test data')
X_test = vectorizer.transform(X_test)

print('at Classifier')
classifier = LogisticRegression()
classifier.fit(X_train, y_train)

predictions = classifier.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, predictions)

confusion_matrix = confusion_matrix(y_test, predictions)
print(confusion_matrix)

Accuracy: 0.8
[[135  42]
 [ 30 153]]

mjspier · Answer 3 · 2015-01-31T19:47:30.977

3

I have an option here. It took 3 minutes on my machine (I should really get a new one :P).

macbook 2006
2 GHz Intel Core 2 Duo
2 GB DDR2 SDRAM

The achieved accuracy was: 0.355421686747

I'm sure if you tune the vector machine you can get better results.

First I changed the format of the csv files so it can be easier imported. I just replaced the first whitespace with a comma which can be used as delimiter during import.

cat testing.csv | sed 's/\ /,/' > test.csv
cat training.csv | sed 's/\ /,/' > train.csv

In python I used pandas to read the csv files and list comprehension to extract the features. This is much faster than for loops. Afterwards I used sklearn to train a support vector machine.

import pandas
from sklearn import svm
from sklearn.metrics import accuracy_score

featureList = ['obama','usa','bieber']

train_df = pandas.read_csv('train.csv',sep=',',dtype={'label':int, 'tweet':str})
test_df = pandas.read_csv('test.csv',sep=',',dtype={'label':int, 'tweet':str})

train_features = [[w in str(tweet) for w in featureList] for tweet in train_df.values[:,1]]
test_features = [[w in str(tweet) for w in featureList] for tweet in test_df.values[:,1]]
train_labels = train_df.values[:,0]
test_labels = test_df.values[:,0]

clf = svm.SVC(max_iter=1000)
clf.fit(train_features, train_labels)
prediction = clf.predict(test_features)

print 'accuracy: ',accuracy_score(test_labels.tolist(), prediction.tolist())

edited Jan 31 '15 at 19:47

answered Jan 31 '15 at 19:09

mjspier

6,386
5
33
43

This is helpful. I need to adjust my training and testing dataset – Shahriar Feb 01 '15 at 15:25
`featureList = ['obama','usa','bieber']` why this three feature? I tried all unique words. But it gives memory error.. Any techniques? – Shahriar Feb 01 '15 at 15:36
1

The three features were just a guess from me for testing. I saw that this three words occur in some tweets. I thought you have your own list. If you want to use all unique words, I think this implementation will not work. Mostly you don't want to use all unique words though as many words are maybe only present in one tweet. Maybe it would be good to use the words which occur the most. It is also not so clear to me what you want to predict. – mjspier Feb 01 '15 at 16:08
Is it possible to get 80% accuracy somehow? – Shahriar Feb 01 '15 at 16:25
I agree with superbly. Using all words might be more data than needed (as said in my answer, you might need to prune the feature list). It's hard to tell really because you never told us what exactly you're trying to predict with this data. I think you need to formulate your problem properly before any more numbers are crunched by someone else than you. – runDOSrun Feb 01 '15 at 16:41

How to train large Dataset for classification

3 Answers3