I have a dataset of movie reviews. It has two columns: 'class'
and 'reviews'
. I have done most of the routine preprocessing stuff, such as: lowering the characters, removing stop words, removing punctuation marks. At the end of preprocessing, each original review looks like words separated by space delimiter.
I want to use CountVectorizer and then TF-IDF in order to create features of my dataset so i can do classification/text recognition with Random Forest. I looked into websites and i tried to do how they did. This is my code:
data = pd.read_csv('updated-data ready.csv')
X = data.drop('class', axis = 1)
y = data['class']
vectorizer = CountVectorizer()
new_X = vectorizer.fit_transform(X)
tfidfconverter = TfidfTransformer()
X1 = tfidfconverter.fit_transform(new_X)
print(X1)
But, i get this output...
(0, 0) 1.0
which doesn't make sense at all. I tackled with some parameters and commented out the parts about TF-IDF. Here's my code:
data = pd.read_csv('updated-data ready.csv')
X = data.drop('class', axis = 1)
y = data['class']
vectorizer = CountVectorizer(analyzer = 'char_wb', \
tokenizer = None, \
preprocessor = None, \
stop_words = None, \
max_features = 5000)
new_X = vectorizer.fit_transform(X)
print(new_X)
and this is my output:
(0, 4) 1
(0, 6) 1
(0, 2) 1
(0, 5) 1
(0, 1) 2
(0, 3) 1
(0, 0) 2
Am i missing something? Or am i too noob to understand? All i understood and want was/is if i do transform, i will receive a new dataset with so many features (regarding the words and their frequencies) plus label column. But, what i am getting is so far from it.
I repeat, all i want is to have a new dataset out of my dataset with reviews in which it has numbers, words as features, so Random Forest or other classification algorithms can do anything with it.
Thanks.
Btw, this is first five rows of my dataset:
class reviews
0 1 da vinci code book awesome
1 1 first clive cussler ever read even books like ...
2 1 liked da vinci code lot
3 1 liked da vinci code lot
4 1 liked da vinci code ultimatly seem hold