AttributeError: lower not found; using a Pipeline with a CountVectorizer in scikit-learn

Question

I have a corpus as such:

X_train = [ ['this is an dummy example'] 
      ['in reality this line is very long']
      ...
      ['here is a last text in the training set']
    ]

and some labels:

y_train = [1, 5, ... , 3]

I would like to use Pipeline and GridSearch as follows:

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('reg', SGDRegressor())
])


parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'tfidf__use_idf': (True, False),
    'reg__alpha': (0.00001, 0.000001),
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=1, verbose=1)

grid_search.fit(X_train, y_train)

When I run this, I get an error saying AttributeError: lower not found.

I searched and found a question about this error here, which lead me to believe that there was a problem with my text not being tokenized (which sounded like it hit the nail on the head, since I was using a list of list as input data, where each list contained one single unbroken string).

I cooked up a quick and dirty tokenizer to test this theory:

def my_tokenizer(X):
    newlist = []
    for alist in X:
        newlist.append(alist[0].split(' '))
    return newlist

which does what it is supposed to, but when I use it in the arguments to the CountVectorizer:

pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=my_tokenizer)),

...I still get the same error as if nothing happened.

I did notice that I can circumvent the error by commenting out the CountVectorizer in my Pipeline. Which is strange...I didn't think you could use the TfidfTransformer() without first having a data structure to transform...in this case the matrix of counts.

Why do I keep getting this error? Actually, it would be nice to know what this error means! (Was lower called to convert the text to lowercase or something? I can't tell from reading the stack trace). Am I misusing the Pipeline...or is the problem really an issue with the arguments to the CountVectorizer alone?

Any advice would be greatly appreciated.

Ibraim Ganiev · Accepted Answer · 2015-11-09T17:19:38.123

13

It's because your dataset is in wrong format, you should pass "An iterable which yields either str, unicode or file objects" into CountVectorizer's fit function (Or into pipeline, doesn't matter). Not iterable over other iterables with texts (as in your code). In your case List is iterable, and you should pass flat list whose members are strings (not another lists).

i.e. your dataset should look like:

X_train = ['this is an dummy example',
      'in reality this line is very long',
      ...
      'here is a last text in the training set'
    ]

Look at this example, very useful: Sample pipeline for text feature extraction and evaluation

edited Nov 09 '15 at 17:19

answered Nov 09 '15 at 17:09

Ibraim Ganiev

8,934
3
33
52

2

Coincidentally, I based my code off this example. Since the example pulls it's data from `sklearn.datasets.fetch_20newsgroups`, it is unclear what format that data is in (list? matrix?). The documentation isn't very helpful on this detail either. – tumultous_rooster Nov 09 '15 at 19:40
@MattO'Brien Yep, i can only recommend to use IPython console or Jupyter notebooks (Or simply standard python interpreter / debugger, if you don't want to install additional software), to see intermediate results, it helps a lot in understanding of such small details. – Ibraim Ganiev Nov 09 '15 at 20:26
I do use iPython notebook but merely read the example and modified it for my own purposed. I didn't actually execute it the original example, assuming that the input was a list of lists. I should have done my due-diligence. – tumultous_rooster Nov 09 '15 at 21:18

score 1 · Answer 2 · answered Dec 10 '20 at 04:59

1

You can pass data like this:

from sklearn import metrics
text_clf.fit(list(X_train), list(y_train))
predicted = text_clf.predict(list(X_test))
print(metrics.classification_report(list(y_test), predicted))

answered Dec 10 '20 at 04:59

Afzal Mansury

19
1

AttributeError: lower not found; using a Pipeline with a CountVectorizer in scikit-learn

2 Answers2

Linked