get_feature_names not found in countvectorizer()

Question

I'm mining the Stack Overflow data dump of posts about deep learning libraries. I'd like to identify stop words in my corpus (like 'python' for instance). I want to get my feature names so I can identify the words with highest term frequencies.

I create my documents and my corpus as follows:

with open("StackOverflow_2018_Data.csv") as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    line_count = 0
    pytorch_doc = ''
    tensorflow_doc = ''
    cotag_list = []
    keras_doc = ''
    counte = 0
    for row in csv_reader:
        if row[2] == 'tensorflow':
            tensorflow_doc += row[3] + ' '
        if row[2] == 'keras':
            keras_doc += row[3] + ' '
        if row[2] == 'pytorch':
            pytorch_doc += row[3] + ' '

corpus = [pytorch_doc, tensorflow_doc, keras_doc]
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(corpus)
print(x)
x.toarray()
Dict = []
feat = x.get_feature_names()
for i,arr in enumerate(x):
    for x, ele in enumerate(arr):
        if i == 0:
            Dict += ('pytorch', feat[x], ele)
        if i == 1:
            Dict += ('tensorflow', feat[x], ele)
        if i == 2:
            Dict += ('keras', feat[x], ele)

sorted_arr = sorted(Dict, key=lambda tup: tup[2])

However, I am getting:

  File "sklearn_stopwords.py", line 83, in <module>
    main()
  File "sklearn_stopwords.py", line 50, in main
    feat = x.get_feature_names()
  File "/opt/anaconda3/lib/python3.7/site-packages/scipy/sparse/base.py", line 686, in __getattr__
    raise AttributeError(attr + " not found")
AttributeError: get_feature_names not found

I think you need `vectorizer.get_feature_names()`. The results of the fit_transform are a np.array. The CountVectorizer object has the get_feature_names method. — Scott Boston, Apr 04 '19 at 19:52

score 7 · Accepted Answer · answered Apr 04 '19 at 19:59

get_feature_names is a method in the CountVectorizer Object. You are trying to access get_feature_names the results of fit_transform which is a scipy.sparse matrix.

You need to use vectorizer.get_feature_names().

Try this MVCE:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = ['This is the first document.',
          'This is the second second document.',
          'And the third one.',
          'Is this the first document?']

X = vectorizer.fit_transform(corpus)

features = vectorizer.get_feature_names()

features

Output:

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

score 1 · Answer 2 · answered Jun 29 '23 at 12:54

I encountered a similar issue with my code. After reviewing the release notes for CountVectorizer, I learned that the get_feature_names method is no longer supported in version 1.2 and above. However, it should work fine in version 1.1 and earlier. If you have upgraded your scikit-learn version to 1.2, I suggest using the get_feature_names_out method instead. You can find more details in the following link:

https://scikit-learn.org/1.1/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html?highlight=countvector

I hope this information is helpful. Best regards.

score 0 · Answer 3 · answered Jun 18 '23 at 10:16

0

You may have to use get_feature_names_out as the method get_feature_names was deprecated in the v1.0 branch but it wasn't fully removed until the v1.2 branch.

answered Jun 18 '23 at 10:16

Zenul_Abidin

573
8
23

get_feature_names not found in countvectorizer()

3 Answers3

Linked