Either one of corpus_file or corpus_iterable value must be provided while training the word2vec model python

Question

I have just started with word2vec model and I want to make different cluster from my questions data.

So to make clusters what I got is, I have to

Create word embedding model Get the word vector from the model Create sentence vector from word vector Using Kmeans cluster the questions data

So to get the word2vec word vector, one of the article says

def get_word2vec(tokenized_sentences):
    print("Getting word2vec model...")
    model = Word2Vec(tokenized_sentences, min_count=1)
    return model.wv

and then just create sentence vector and the Kmeans.

and other article says, after getting the word2vec model I have to build the vocab and then need to train the model. And then create sentence vector and then Kmeans/

def get_word2vec_model(tokenized_sentences):
    start_time = time.time()
    print("Getting word2vec model...")
    model = Word2Vec(tokenized_sentences, sg=1, window=window_size,vector_size=size, min_count=min_count, workers=workers, epochs=epochs, sample=0.01)
    log_total_time(start_time)
    return model 


def get_word2vec_model_vector(model):
    start_time = time.time()
    print("Training...")
#     model = Word2Vec(tokenized_sentences, min_count=1)
    model.build_vocab(sentences=shuffle_corpus(tokenized_sentences), update=True)
    # Training the model
    for i in tqdm(range(5)):
        model.train(sentences=shuffle_corpus(tokenized_sentences), epochs=50, total_examples=model.corpus_count)
    log_total_time(start_time)
    return model.wv

def shuffle_corpus(sentences):
    shuffled = list(sentences)
    random.shuffle(shuffled)
    return shuffled

and this is how my tokenized_sentences look like

8857                                     [, , , year, old]
11487     [, , birthday, canada, cant, share, job, friend]
20471                       [, , chat, people, also, talk]
5877                                           [, , found]

Q1) the second approach gives the following error

---> 54     model.build_vocab(sentences=shuffle_corpus(tokenized_sentences), update=True)
     55     # Training the model
     56     for i in tqdm(range(5)):

~\AppData\Local\Programs\Python\Python38\lib\site-packages\gensim\models\word2vec.py in build_vocab(self, corpus_iterable, corpus_file, update, progress_per, keep_raw_vocab, trim_rule, **kwargs)
    477 
    478         """
--> 479         self._check_corpus_sanity(corpus_iterable=corpus_iterable, corpus_file=corpus_file, passes=1)
    480         total_words, corpus_count = self.scan_vocab(
    481             corpus_iterable=corpus_iterable, corpus_file=corpus_file, progress_per=progress_per, trim_rule=trim_rule)

~\AppData\Local\Programs\Python\Python38\lib\site-packages\gensim\models\word2vec.py in _check_corpus_sanity(self, corpus_iterable, corpus_file, passes)
   1484         """Checks whether the corpus parameters make sense."""
   1485         if corpus_file is None and corpus_iterable is None:
-> 1486             raise TypeError("Either one of corpus_file or corpus_iterable value must be provided")
   1487         if corpus_file is not None and corpus_iterable is not None:
   1488             raise TypeError("Both corpus_file and corpus_iterable must not be provided at the same time")

TypeError: Either one of corpus_file or corpus_iterable value must be provided

and

Q2) Is is necessary build the vocab and then train the data? or getting model is the only thing i need to do?

score 1 · Answer 1 · answered May 31 '21 at 17:12

Instead of doing model.build_vocab(sentences=shuffle_corpus(tokenized_sentences), update=True)

Replace the sentence param name with corpus_iterable so if your iterable is working fine, you should be able to generate easily as:

model.build_vocab(shuffle_corpus(tokenized_sentences), update=True)

or

model.build_vocab(corpus_iterable=shuffle_corpus(tokenized_sentences), update=True)

It requires List of List for the training so try to provide the data in that format. Also, try to clean your data. I think empty spaces are not a good choices but I haven't tried those either. Everything else is same. Just follow the official Documentation on FastText training and that should keep you going. It works for Word2Vec too but this one has more explanations to it.

NOTE: The example given there is from old version that is why sentence= param is giving errors

Q.2: The mode build vocab. It is obviously necessary to build the vocab else how would the model know what is a,the,book,reader and so on. every word needs a corresponding number and that is what it is for. If you are working with some data where you have many OOV words, try FastText.

It has one thing that by looking at Astronomer and geology, it can give you embedding for astrology even if it has not seen it even once.

score 0 · Answer 2 · answered May 31 '21 at 18:47

In recent versions of Gensim, the name sentences has been replaced. (It often misled people into thinking each text had to be a proper sentence, as opposed to just a list-of-tokens.)

You should either specify your corpus as a corpus_iterable (if it's something like a Python list, or sequence that is re-iterable), or as a corpus_file (if it's in a single disk file that's already broken into texts by newline, and tokens by spaces).

Separately:

You probably don't need the complication of re-shuffling your corpus repeatedly. (If your datasource has massive clumping of word-types in certain ranges – like all examples of word A occur in a row of texts early on, and all examples of word B appear in a row of text late – then one shuffle before starting may help, so all words are equally likely to be found early and late in the corpus.)
Calling .train() multiple times is almost always a mistake which causes confusion over the training passes & mismanagement of the learning-rate alpha decay. See this answer about (related-algorithm) Doc2Vec for more details: https://stackoverflow.com/a/62801053/130288

Either one of corpus_file or corpus_iterable value must be provided while training the word2vec model python

2 Answers2