I have just started with word2vec model and I want to make different cluster from my questions data.
So to make clusters what I got is, I have to
Create word embedding model Get the word vector from the model Create sentence vector from word vector Using Kmeans cluster the questions data
So to get the word2vec word vector, one of the article says
def get_word2vec(tokenized_sentences):
print("Getting word2vec model...")
model = Word2Vec(tokenized_sentences, min_count=1)
return model.wv
and then just create sentence vector and the Kmeans.
and other article says, after getting the word2vec model I have to build the vocab and then need to train the model. And then create sentence vector and then Kmeans/
def get_word2vec_model(tokenized_sentences):
start_time = time.time()
print("Getting word2vec model...")
model = Word2Vec(tokenized_sentences, sg=1, window=window_size,vector_size=size, min_count=min_count, workers=workers, epochs=epochs, sample=0.01)
log_total_time(start_time)
return model
def get_word2vec_model_vector(model):
start_time = time.time()
print("Training...")
# model = Word2Vec(tokenized_sentences, min_count=1)
model.build_vocab(sentences=shuffle_corpus(tokenized_sentences), update=True)
# Training the model
for i in tqdm(range(5)):
model.train(sentences=shuffle_corpus(tokenized_sentences), epochs=50, total_examples=model.corpus_count)
log_total_time(start_time)
return model.wv
def shuffle_corpus(sentences):
shuffled = list(sentences)
random.shuffle(shuffled)
return shuffled
and this is how my tokenized_sentences look like
8857 [, , , year, old]
11487 [, , birthday, canada, cant, share, job, friend]
20471 [, , chat, people, also, talk]
5877 [, , found]
Q1) the second approach gives the following error
---> 54 model.build_vocab(sentences=shuffle_corpus(tokenized_sentences), update=True)
55 # Training the model
56 for i in tqdm(range(5)):
~\AppData\Local\Programs\Python\Python38\lib\site-packages\gensim\models\word2vec.py in build_vocab(self, corpus_iterable, corpus_file, update, progress_per, keep_raw_vocab, trim_rule, **kwargs)
477
478 """
--> 479 self._check_corpus_sanity(corpus_iterable=corpus_iterable, corpus_file=corpus_file, passes=1)
480 total_words, corpus_count = self.scan_vocab(
481 corpus_iterable=corpus_iterable, corpus_file=corpus_file, progress_per=progress_per, trim_rule=trim_rule)
~\AppData\Local\Programs\Python\Python38\lib\site-packages\gensim\models\word2vec.py in _check_corpus_sanity(self, corpus_iterable, corpus_file, passes)
1484 """Checks whether the corpus parameters make sense."""
1485 if corpus_file is None and corpus_iterable is None:
-> 1486 raise TypeError("Either one of corpus_file or corpus_iterable value must be provided")
1487 if corpus_file is not None and corpus_iterable is not None:
1488 raise TypeError("Both corpus_file and corpus_iterable must not be provided at the same time")
TypeError: Either one of corpus_file or corpus_iterable value must be provided
and
Q2) Is is necessary build the vocab and then train the data? or getting model is the only thing i need to do?