3

Using sklearn I've created a BOW with 200 features in Python, which are easily extracted. But, how can I reverse it? That is, go from a vector with 200 0's or 1's to the corresponding words? Since the vocabulary is a dictionary, thus not ordered, I am not sure which word each element in the feature list corresponds to. Also, if the first element in my 200 dimensional vector corresponds to the first word in the dictionary, how do I then extract a word from the dictionary via index?

The BOW is created this way

vec = CountVectorizer(stop_words = sw, strip_accents="unicode", analyzer = "word", max_features = 200)
features = vec.fit_transform(data.loc[:,"description"]).todense()

thus "features" is a matrix (n,200) matrix (n being the number of sentence).

sacuL
  • 49,704
  • 8
  • 81
  • 106
CutePoison
  • 4,679
  • 5
  • 28
  • 63

1 Answers1

3

I'm not totally sure what you're going for, but it seems like you're just trying to figure out which column represents which word. For this, there is the handy get_feature_names argument.

Let's take a look with the example corpus provided in the docs:

corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?' ]

# Put into a dataframe
data = pd.DataFrame(corpus,columns=['description'])
# Take a look:
>>> data
                             description
0            This is the first document.
1  This document is the second document.
2             And this is the third one.
3            Is this the first document?

# Initialize CountVectorizer (you can put in your arguments, but for the sake of example, I'm keeping it simple):
vec = CountVectorizer()

# Fit it as you had before:
features = vec.fit_transform(data.loc[:,"description"]).todense()

>>> features
matrix([[0, 1, 1, 1, 0, 0, 1, 0, 1],
        [0, 2, 0, 1, 0, 1, 1, 0, 1],
        [1, 0, 0, 1, 1, 0, 1, 1, 1],
        [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

To see what column represents which word use get_feature_names:

>>> vec.get_feature_names()
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

So your first column is and, second is document, and so on. For readability, you can stick this in a dataframe:

>>> pd.DataFrame(features, columns = vec.get_feature_names())
   and  document  first  is  one  second  the  third  this
0    0         1      1   1    0       0    1      0     1
1    0         2      0   1    0       1    1      0     1
2    1         0      0   1    1       0    1      1     1
3    0         1      1   1    0       0    1      0     1
sacuL
  • 49,704
  • 8
  • 81
  • 106
  • 1
    That is what I'm looking for! I've searched the documentation and alot of examples and I have no idea how I missed that. Thank you! :D – CutePoison Oct 11 '18 at 12:23