Python Scikit-learn CountVectorizer throwing ValueError: empty vocabulary

Question

I'm trying to extract features from a text document. Here is my code:

import sklearn
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
files = sklearn.datasets.load_files('/home/niyas/Documents/project/container', shuffle = False)
vectorizer = CountVectorizer(min_df=1)
X = vectorizer.fit_transform(files.data[1])
Y=vectorizer.get_feature_names()

I'm getting an error "ValueError: empty vocabulary; perhaps the documents only contain stop words". The code works fine when I pass a string with the exact same content of the text doc.

Help me. Thanks in advance.

Shouldn't it just be ``files.data``? can you give the content of ``files.data[0]``? — Andreas Mueller, Mar 05 '15 at 06:02
Actually I can print `files.data[1]` and the above code works fine when I pass a string with the exact same content of the text doc. — Niyas, Mar 05 '15 at 18:25
I know you can print, ``files.data[1]``, I was asking what it contains. Which text doc? The one that is found by ``load_files`` as ``files.data[1]``? — Andreas Mueller, Mar 05 '15 at 19:20
Here is the complete text. "Acer S1213Hne 3D Ready DLP Projector - 720p - HDTV - 4:3 - 2.6 - NTSC, PAL, SECAM - 1024 x 768 - XGA - 17,000:1 - 3000 lm - HDMI - USB - VGA In - Ethernet - 290 W - White Color - 1 Year WarrantyThe professional Acer P7 Series employs advanced technologies and convenient setup utilities to deliver clear, persuasive presentations in large business venues. Enterprising innovators like you will appreciate the high resolution, brightness and contrast ratio to drive your ventures toward greater success." — Niyas, Mar 10 '15 at 02:41
yeah so it is only a single string. It needs to be a list of strings, so either `` vectorizer.fit_transform([files.data[1]])`` or use all files: ``vectorizer.fit_transform(files.data)``. — Andreas Mueller, Mar 10 '15 at 03:06

Python Scikit-learn CountVectorizer throwing ValueError: empty vocabulary

0 Answers0