-1

I am trying to perform some text classification using machine learning and for that I have extracted feature vectors from the per-processed textual data using simple bag of words approach(count vectorizer) and tfidf vectorizer.

Now I want to use word2vec i.e. word embedding as my feature vector similar as that of count vectorizer/tfidf vectorizer where I should be able to learn vocabulary from the train data and transform or fit the test data with the learned vocab but I can't find a way to implement that.

//I need something like this with word2vec

count = CountVectorizer()
train_feature_ vector =count.fit_transform(train_data)
test_feature_vector = count.fit(test_data)

//So I can train my model like this
mb = MultinomialNB()
mb.fit(train_feature_vector,y_train)
acc_score = mb.score(test_feature_vector,y_test)
print("Accuracy "+str(acc_score))
Dsujan
  • 411
  • 3
  • 14

1 Answers1

1

You first should understand what Word Embeddings are. When you apply a CountVectorizer or TfIdfVectorizer what you get is a sentence representation in a sparse way, commonly known as a One Hot encoding. The word embeddings representation are used to represent a word in a high dimensional space of real numbers.

Once you get your per word representation there are some ways to do this, check:How to get vector for a sentence from the word2vec of tokens in sentence

gojomo
  • 52,260
  • 14
  • 86
  • 115
OSainz
  • 522
  • 3
  • 6