0

I have a csv file with 3483 lines and 460K characters and 65K words, and I'm trying to use this corpus to train a NaiveBayes classifier in Scikit-learn.

The problem is when I use this statement below, takes too long (1 hour and did not finish).

from textblob import TextBlob
from textblob.classifiers import NaiveBayesClassifier 
import csv 

with open('train.csv', 'r') as fp:
    cl = NaiveBayesClassifier(fp, format="csv") 

Any guesses of what I doing wrong?

Thanks in advance.

Flavio
  • 759
  • 1
  • 11
  • 24
  • Is your CSV file formatted like so : http://textblob.readthedocs.io/en/dev/classifiers.html – vendaTrout Feb 12 '17 at 14:27
  • Yes @vendaTrout This is an example of the file: ```instagrama,INSTAGRAM #fb,FACEBOOK facebookio,FACEBOOK facebooktime messenger iphone,FACEBOOK whatsapp com,WHATSSUP facebooko #fb,FACEBOOK facebookiokio #fb,FACEBOOK instagramas: ,INSTAGRAM facebook https:fb,FACEBOOK facebook #fb,FACEBOOK``` – Flavio Feb 12 '17 at 14:47
  • Assuming, each train data and label is separated by a "\n", can you profile the function for a smaller csv, or this. Please have a look at the stdlib [profiling](https://docs.python.org/3/library/profile.html) module. – vendaTrout Feb 12 '17 at 14:57
  • I made a small csv with 200 lines and it takes 3 minutes to load.How can I profile this? – Flavio Feb 12 '17 at 15:13
  • I am also facing this issue but no luck, any alternative for same work ? – Vineet May 31 '18 at 04:36

2 Answers2

1

There's a problem with this lib.

It's documented in the following links:

https://github.com/sloria/TextBlob/pull/136

https://github.com/sloria/TextBlob/issues/77

Small story: The library do not deals well with large datasets.

Flavio
  • 759
  • 1
  • 11
  • 24
0

I am not entirely sure of the text blob library but perhaps this may help-

I had written the following code to train a multinomial naive bayes model with raw textual data after vectorizing and transforming the text in my dataset.

from sklearn.feature_extraction.text import TfidfTransformer
import pandas as pd
from sklearn import model_selection
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score

#import dataset
url = ("C:\\Users\\sidharth.m\\Desktop\\Project_sid_35352\\Final.csv")
documents = pd.read_csv(url)

array = documents.values

x = array[0:, 1]

y= array[0:, 0]


count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(x)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

model=MultinomialNB().fit(X_train_tfidf, y)

predicted = model.predict(X_train_tfidf)

acc = accuracy_score(y, predicted)
print(acc)
SalazarSid
  • 64
  • 1
  • 12