I have a corpus as such:
X_train = [ ['this is an dummy example']
['in reality this line is very long']
...
['here is a last text in the training set']
]
and some labels:
y_train = [1, 5, ... , 3]
I would like to use Pipeline and GridSearch as follows:
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('reg', SGDRegressor())
])
parameters = {
'vect__max_df': (0.5, 0.75, 1.0),
'tfidf__use_idf': (True, False),
'reg__alpha': (0.00001, 0.000001),
}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=1, verbose=1)
grid_search.fit(X_train, y_train)
When I run this, I get an error saying AttributeError: lower not found
.
I searched and found a question about this error here, which lead me to believe that there was a problem with my text not being tokenized (which sounded like it hit the nail on the head, since I was using a list of list as input data, where each list contained one single unbroken string).
I cooked up a quick and dirty tokenizer to test this theory:
def my_tokenizer(X):
newlist = []
for alist in X:
newlist.append(alist[0].split(' '))
return newlist
which does what it is supposed to, but when I use it in the arguments to the CountVectorizer
:
pipeline = Pipeline([
('vect', CountVectorizer(tokenizer=my_tokenizer)),
...I still get the same error as if nothing happened.
I did notice that I can circumvent the error by commenting out the CountVectorizer
in my Pipeline. Which is strange...I didn't think you could use the TfidfTransformer()
without first having a data structure to transform...in this case the matrix of counts.
Why do I keep getting this error? Actually, it would be nice to know what this error means! (Was lower
called to convert the text to lowercase or something? I can't tell from reading the stack trace). Am I misusing the Pipeline...or is the problem really an issue with the arguments to the CountVectorizer
alone?
Any advice would be greatly appreciated.