1

I am doing Document Classification and obtained accuracy upto 76%. And while predicting the document category i did following one

doc_clf.predict(tf_idf.transform((count_vect.transform([r'document']))))

and i get the following error:

File "/usr/local/lib/python3.5/dist- packages/sklearn/utils/metaestimators.py", line 115, in <lambda>
  out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/sklearn/pipeline.py", line 306, in predict
  Xt = transform.transform(Xt)
File "/usr/local/lib/python3.5/dist-packages/sklearn/feature_extraction/text.py", line 923, in transform
  _, X = self._count_vocab(raw_documents, fixed_vocab=True)
File "/usr/local/lib/python3.5/dist-packages/sklearn/feature_extraction/text.py", line 792, in _count_vocab
  for feature in analyze(doc):
File "/usr/local/lib/python3.5/dist-packages/sklearn/feature_extraction/text.py", line 266, in <lambda>
  tokenize(preprocess(self.decode(doc))), stop_words)
File "/usr/local/lib/python3.5/dist-packages/sklearn/feature_extraction/text.py", line 232, in <lambda>
  return lambda x: strip_accents(x.lower())
File "/usr/local/lib/python3.5/dist-packages/scipy/sparse/base.py", line 647, in __getattr__
  raise AttributeError(attr + " not found")

How do i correct this error ? And any other way to improve the accuracy further?

I share link to review full code Full Code

L3viathan
  • 26,748
  • 2
  • 58
  • 81
Madhi
  • 1,206
  • 3
  • 16
  • 27

1 Answers1

6

In your code, doc_clf is a pipeline. So the tf_idf.transform() and count_vect.transform() will be handled automatically by the pipeline.

You should only call

category = doc_clf.predict([r'document'])

As this document passes through the pipeline, it will be automatically transformed by the CountVectorizer and TfidfTransformer.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132