Tfidf empty vocabulary; perhaps the documents only contain stop words

Question

curernttly I am working on a project and using Tfidf to transform X_train data which contain the text data. When I am using count_vectorizer.fit_transform(X_train) I get this error:

Traceback (most recent call last):
  File "train.py", line 100, in <module>
    counts = count_vectorizer.fit_transform(X_train)
  File "/home/vishalthadari/Documents/Seperation 1/API's/Confirmation API/python 3 /env/lib/python3.6/site-packages/sklearn/feature_extraction/text.py", line 869, in fit_transform
    self.fixed_vocabulary_)
  File "/home/vishalthadari/Documents/Seperation 1/API's/Confirmation API/python 3 /env/lib/python3.6/site-packages/sklearn/feature_extraction/text.py", line 811, in _count_vocab
    raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words

I read other stackoverflow questions like this Link But i cannot able to understand how to split the data of X_train

Here's my Train.py file

import os
import numpy
from pandas import DataFrame
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

NEWLINE = '\n'

TRAVEL = 'Travel'
OTHER = 'Other'

SOURCES = [
    ('data/travel',    TRAVEL),
    ('data/other',    OTHER),
]

SKIP_FILES = {'cmds', '.DS_Store'}

SEED = 0 # for reproducibility

def read_files(path):
    #Reads all files in all directories mentioned in SOURCES
    for root, dir_names, file_names in os.walk(path):
        for path in dir_names:
            read_files(os.path.join(root, path))
        for file_name in file_names:
            if file_name not in SKIP_FILES:
                file_path = os.path.join(root, file_name)
                if os.path.isfile(file_path):
                    past_header, lines = False, []
                    f = open(file_path, encoding="latin-1")
                    for line in f:
                        if past_header:
                            lines.append(line)
                        elif line == NEWLINE:
                            past_header = True
                    f.close()
                    content = NEWLINE.join(lines)
                    yield file_path, content

def build_data_frame(path, classification):
    #Returns a data frame of all the files read using read_files()
  data_frame = DataFrame({'text': [], 'class': []})
  for file_name, text in read_files(path):
    data_frame = data_frame.append(
        DataFrame({'text': [text], 'class': [classification]}, index=[file_name]))
  return data_frame

data = DataFrame({'text': [], 'class': []})
for path, classification in SOURCES:
    data = data.append(build_data_frame(path, classification))

data = data.reindex(numpy.random.permutation(data.index))

#Training data
X_train = numpy.asarray(data['text'])
count_vectorizer = CountVectorizer()
counts = count_vectorizer.fit_transform(X_train)

I followed all the solutions but still didnt solved the issue. Is i am doing the wrong apprach to transform the data if i am doing right then why i am getting this error.

Thanks in Advance

Welcome to SO; please see [How to create a Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve), as well as why [a wall of code isn't helpful](http://idownvotedbecau.se/toomuchcode/). — desertnaut, Nov 06 '18 at 16:28
thanks alot as i am new to stackoverflow it will take time to settle with the community. really appreciated with your suggestion — user10538706, Nov 06 '18 at 16:31
You are welcome; SO does not work by simply throwing-in the whole of our code, and here, since your problem is in the vectorizing, arguably everything related to ML models themselves (among possibly other things) is irrelevant to the issue and should be removed (it only adds clutter)... — desertnaut, Nov 06 '18 at 16:33
Have edited your code to remove irrelevant things just to give you an idea (but still, it is neither minimal nor reproducible)... — desertnaut, Nov 06 '18 at 17:26
thanks a lot but I cannot able to understand can you please elaborate the solution — user10538706, Nov 07 '18 at 05:46
No solution - just tidied up your code to remove stuff irrelevant to the issue... — desertnaut, Nov 07 '18 at 09:12

score 0 · Answer 1 · answered Jul 27 '23 at 15:24

One scenario where this might happens is when you data (pandas column, list or string) has no useful text. The values might be empty or could contain only a special character across the data.

In such cases, replace that empty value with a placeholder text. Something like this:

df['column_1'] = 'Some text'

or replace na values using:

df['column_1'] = df['column_1'].fillna('Some text')

Tfidf empty vocabulary; perhaps the documents only contain stop words

1 Answers1