Runtime Error in doc2vec model for a preprocessed dataset

Question

I have a dataset from amazon reviews dataset: meta_Electronics.json.gz

The below code is given by instructor:

def read_product_description(fname):
    '''
    Load all product descriptions
    Args: 
        fname: dataset file path
    Returns:
        dict: key is asin, value is description content
    '''
    result = {}
    for i in parse(fname):
        try:
            if "Camera & Photo" in i["categories"][0]:
                result[i["asin"]]=i["description"]
        except:
            continue
    return result

I think the above code filters reviews in camera& photo category.

class TaggedDescriptionDocument(object):
    '''
    This class could save all products and review information in its dictionary and generate iter for TaggedDocument
        which could used for Doc2Vec model
    '''
    def __init__(self, descriptondict):
        self.descriptondict = descriptondict
        

    def __iter__(self):
        for asin in self.descriptondict:
            for content in self.descriptondict[asin]:
                yield TaggedDocument(clean_line(content), [asin])

Note: clean_line just cleans every single line in the content,remove punctuation,etc.

description_dict = read_product_description("meta_Electronics.json.gz")
des_documents = TaggedDescriptionDocument(description_dict)

After the above two functions,I think it creates a taggeddocument used for doc2vec model. However,when I tried to train a doc2vec model,it shows:

model_d = Doc2Vec(des_documents, vector_size=100, window=15, min_count=0, max_vocab_size=1000)

RuntimeError: you must first build vocabulary before training the model

The min_count is already 0. Is there anything wrong with the code? Any help will be appreciated!

score 0 · Answer 1 · answered Mar 06 '23 at 06:47

The you must first build vocabulary error suggests something, such as a buggy corpus, prevented any vocabulary from being discovered.

Are you sure des_documents contains what you intended it to?

For example:

If you execute sum(1 for _ in des_documents) repeatedly, does it report the same count of documents you expect?
Does looking at the 1st item returned by the iterable sequence – next(iter(des_documents) – show a valid TaggedDocument object with sensible words and tags?

You should also try enabling logging at the INFO level, and try all steps again, watching the logged output carefully for any hints something is going wrong. (Do steps take a reasonable mount of time, & report counts of discovered/surviving words that make sense?)

max_vocab_size=1000 is almost certainly an unhelpful setting. It doesn't cap the final surviving vocabulary - it causes the initial vocabulary-scan to never remember more than 1000 words. And further, to ruthlessly enforce that cap in a crude but low-overhead way, every time it hits the cap, it discards all words with fewer occurrences than an ever-escalating floor.

This setting was only intended as a crude way to prevent vocabulary discovery from exhausting all RAM, and if used at all, should be set to some value far, far larger than whatever vocabulary size you desire or expect. So: your atypically-tiny value of 1000, together with any amount of data sufficient for an algorithm like Doc2Vec (lots and lots of varied words) could be contributing to your problem.

With any dataset you've already got loaded in memory, it's unlikely a needed setting at all.

Separately, min_count=0 is almost always a bad setting for these algorithms, which only effectively model words with many contrasting usage examples. Throwing out words that only appear a few times usually improves the overall quality of the surviving learned vectors – hence the default min_count=5.

Runtime Error in doc2vec model for a preprocessed dataset

1 Answers1