3

I'm mining the Stack Overflow data dump of posts about deep learning libraries. I'd like to identify stop words in my corpus (like 'python' for instance). I want to get my feature names so I can identify the words with highest term frequencies.

I create my documents and my corpus as follows:

with open("StackOverflow_2018_Data.csv") as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    line_count = 0
    pytorch_doc = ''
    tensorflow_doc = ''
    cotag_list = []
    keras_doc = ''
    counte = 0
    for row in csv_reader:
        if row[2] == 'tensorflow':
            tensorflow_doc += row[3] + ' '
        if row[2] == 'keras':
            keras_doc += row[3] + ' '
        if row[2] == 'pytorch':
            pytorch_doc += row[3] + ' '

corpus = [pytorch_doc, tensorflow_doc, keras_doc]
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(corpus)
print(x)
x.toarray()
Dict = []
feat = x.get_feature_names()
for i,arr in enumerate(x):
    for x, ele in enumerate(arr):
        if i == 0:
            Dict += ('pytorch', feat[x], ele)
        if i == 1:
            Dict += ('tensorflow', feat[x], ele)
        if i == 2:
            Dict += ('keras', feat[x], ele)

sorted_arr = sorted(Dict, key=lambda tup: tup[2])

However, I am getting:

  File "sklearn_stopwords.py", line 83, in <module>
    main()
  File "sklearn_stopwords.py", line 50, in main
    feat = x.get_feature_names()
  File "/opt/anaconda3/lib/python3.7/site-packages/scipy/sparse/base.py", line 686, in __getattr__
    raise AttributeError(attr + " not found")
AttributeError: get_feature_names not found
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
maddie
  • 1,854
  • 4
  • 30
  • 66
  • I think you need `vectorizer.get_feature_names()`. The results of the fit_transform are a np.array. The CountVectorizer object has the get_feature_names method. – Scott Boston Apr 04 '19 at 19:52

3 Answers3

7

get_feature_names is a method in the CountVectorizer Object. You are trying to access get_feature_names the results of fit_transform which is a scipy.sparse matrix.

You need to use vectorizer.get_feature_names().

Try this MVCE:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = ['This is the first document.',
          'This is the second second document.',
          'And the third one.',
          'Is this the first document?']

X = vectorizer.fit_transform(corpus)

features = vectorizer.get_feature_names()

features

Output:

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
1

I encountered a similar issue with my code. After reviewing the release notes for CountVectorizer, I learned that the get_feature_names method is no longer supported in version 1.2 and above. However, it should work fine in version 1.1 and earlier. If you have upgraded your scikit-learn version to 1.2, I suggest using the get_feature_names_out method instead. You can find more details in the following link:

https://scikit-learn.org/1.1/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html?highlight=countvector

I hope this information is helpful. Best regards.

Evans sang
  • 31
  • 2
0

You may have to use get_feature_names_out as the method get_feature_names was deprecated in the v1.0 branch but it wasn't fully removed until the v1.2 branch.

Zenul_Abidin
  • 573
  • 8
  • 23