3

The code is used to generate word2vec and use it to train the naive Bayes classifier. I am able to generate word2vec and use the similarity functions successfully.As a next step I would want to use the word2vec to train the naive bayes classifier. Currently the code given an error when I am trying to slit the data in test and training. How do i convert word2vec model into the array so that it can be used as training data.

# Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd import gensim

# Importing the dataset
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)

# Cleaning the texts
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 1000):
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
#    for word2vec we want an array of vectors

    corpus.append(review)

#print(corpus)
X = gensim.models.Word2Vec(corpus, min_count=1,size=1000)
#print (X.most_similar("love"))


#embedding_matrix = np.zeros(len(X.wv.vocab), dtype='float32')
#for i in range(len(X.wv.vocab)):
#    embedding_vector = X.wv[X.wv.index2word[i]]
#    if embedding_vector is not None:
#        embedding_matrix[i] = embedding_vector

# Creating the Bag of Words model
#from sklearn.feature_extraction.text import CountVectorizer
#cv = CountVectorizer(max_features = 1500)
#X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

It gives an error on line -
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)
TypeError: Expected sequence or array-like, got <class 'gensim.models.word2vec.Word2Vec'>
Philip
  • 3,135
  • 2
  • 29
  • 43
Raj
  • 31
  • 1
  • 1
  • 2
  • You need to use convert your corpus to vectors using your embeddings: https://stackoverflow.com/questions/29760935/how-to-get-vector-for-a-sentence-from-the-word2vec-of-tokens-in-sentence – Philip Dec 16 '17 at 12:31

2 Answers2

4

Word2Vec provides word embeddings only. If you want to characterize documents by embeddings, you'll need to perform an averaging/summing/max operation on embeddings of all words from each document to have a D-dimensional vector that can be used for classification. See here and there for further information on this.

Otherwise, you can use Doc2Vec model to directly produce document embeddings, for which gensim also gives a very good provider.

Elliot
  • 308
  • 1
  • 8
1

You have vectors for each word, now you have two approaches to move forward, one could be simply take average of all the words in a sentence to find the sentence vector, another could be to use tfidf.

I implemented the average approach in one of my ongoing projects and i am sharing the github link, please go to the heading "text vectorization(word2vec)" and you will find the code their. https://github.com/abhibhargav29/SentimentAnalysis/blob/master/SentimentAnalysis.ipynb. I would however suggest you to read data cleaning before as well to understand it in a better way.

One important advice: Do not split the data into train, cv, test after vectorization, do it before vectorization or you will overfit the model.