Problem with CountVectorizer from scikit-learn package

Question

I have a dataset of movie reviews. It has two columns: 'class' and 'reviews'. I have done most of the routine preprocessing stuff, such as: lowering the characters, removing stop words, removing punctuation marks. At the end of preprocessing, each original review looks like words separated by space delimiter.

I want to use CountVectorizer and then TF-IDF in order to create features of my dataset so i can do classification/text recognition with Random Forest. I looked into websites and i tried to do how they did. This is my code:

data = pd.read_csv('updated-data ready.csv')
X = data.drop('class', axis = 1)
y = data['class']
vectorizer = CountVectorizer()
new_X = vectorizer.fit_transform(X)
tfidfconverter = TfidfTransformer()  
X1 = tfidfconverter.fit_transform(new_X)
print(X1)

But, i get this output...

(0, 0)  1.0

which doesn't make sense at all. I tackled with some parameters and commented out the parts about TF-IDF. Here's my code:

data = pd.read_csv('updated-data ready.csv')
X = data.drop('class', axis = 1)
y = data['class']
vectorizer = CountVectorizer(analyzer = 'char_wb',  \
                         tokenizer = None, \
                         preprocessor = None, \
                         stop_words = None, \
                         max_features = 5000)

new_X = vectorizer.fit_transform(X)
print(new_X)

and this is my output:

(0, 4)  1
(0, 6)  1
(0, 2)  1
(0, 5)  1
(0, 1)  2
(0, 3)  1
(0, 0)  2

Am i missing something? Or am i too noob to understand? All i understood and want was/is if i do transform, i will receive a new dataset with so many features (regarding the words and their frequencies) plus label column. But, what i am getting is so far from it.

I repeat, all i want is to have a new dataset out of my dataset with reviews in which it has numbers, words as features, so Random Forest or other classification algorithms can do anything with it.

Thanks.

Btw, this is first five rows of my dataset:

   class                                            reviews
0      1                         da vinci code book awesome
1      1  first clive cussler ever read even books like ...
2      1                            liked da vinci code lot
3      1                            liked da vinci code lot
4      1            liked da vinci code ultimatly seem hold

BTW, you can use `TfidfVectorizer`, which combines `CountVectorizer` and `TfidfTransformer`. — iz_, Jan 14 '19 at 06:41
What you call "strange output that does not make sense" is a sparse matrix. You can proceed from here either by turning it to dense, or using it `as it is` as an input to RandomForest classifier. — Sergey Bushmanov, Jan 14 '19 at 06:45
@SergeyBushmanov but, then i see this error: "Found input variables with inconsistent numbers of samples: [1, 7086]"... — , Jan 14 '19 at 06:55
[1,7086] is definitely not what you want to see for this kind of problem/dataset. See my answer below for step by step workflow. — Sergey Bushmanov, Jan 14 '19 at 09:39

Sergey Bushmanov · Accepted Answer · 2019-01-14T08:46:16.273

Suppose you happen to have a dataframe:

data
    class   reviews
0   1   da vinci code book aw...
1   1   first clive cussler ever read even books lik...
2   1   liked da vinci cod...
3   1   liked da vinci cod...
4   1   liked da vinci code ultimatly seem...

Separate into features and outcomes:

y = data['class']
X = data.drop('class', axis = 1)

Then, following your pipeline, you can prepare your data for any ML algo like this:

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
new_X = vectorizer.fit_transform(X.reviews)
new_X
<5x18 sparse matrix of type '<class 'numpy.int64'>'

This new_X can be used in your further pipeline "as is" or converted to dense matrix:

new_X.todense()
matrix([[1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
        [0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1],
        [0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1],
        [0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1]],
       dtype=int64)
        with 30 stored elements in Compressed Sparse Row format>

Rows in this matrix represent rows in the original reviews column and columns represent counts of words. In case you're interested in what column refers to what word you may do:

vectorizer.vocabulary_
{'da': 6,
 'vinci': 17,
 'code': 4,
 'book': 1,
 'awesome': 0,
 'first': 9,
 'clive': 3,
 'cussler': 5,
....

where key is a word and value is column index in the above matrix (you may infer, actually, that column index correspond to ordered vocabulary, with 'awesome' responsible for 0th column and so on).

You may further proceed with your pipeline like this:

tfidfconverter = TfidfTransformer()  
X1 = tfidfconverter.fit_transform(new_X)
X1
<5x18 sparse matrix of type '<class 'numpy.float64'>'
    with 30 stored elements in Compressed Sparse Row format>

Finally, you can feed your preprocessed data into RandomForest:

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X1, y)

This code runs without error on my notebook. Please, let us know if this solves your problem!

I have questions learning wise. On what basis, the words got assigned to the columns. Meaning, how come 'awesome' is the 0th column? And how can i count how many words got to be columns/features? — , Jan 14 '19 at 15:15
How words assigned to columns is decided by `CountVectorizer`. The logic, as I stated is as follows: (i) all unique words extracted and they constitute `vocabulary` (ii) words are ordered alphabetically (iii) first word assigned to first column, second to second and so on. The end result of this exercise can be accessed by `vectorizer.vocabulary_` (see answer). How many words is simply `len(vectorizer.vocabulary_).` — Sergey Bushmanov, Jan 14 '19 at 15:21

Problem with CountVectorizer from scikit-learn package

1 Answers1

Linked