How to work with Doc2Vec and which approach is better training the model on my dataset or using a pretrained model?

Question

I am building a classification model for a dataset of items. Basically, I have 2 columns ex:

Item name	category
unsalted butter	dairy and eggs
cheese	dry grocery
peanut butter cream	dry grocery

I did the required preprocessing to clean the item name which is my input, one hot encoding for the category which is the target output, and I want to use KNN algorithm to classify the item name so I have to convert the item names to numbers.

I am struggling with the conversion model, I am not able to build the right model and check the word2vec accuracy results.

Would you please offer me a help in this since I am begginer in word embeddings technique?

I tried the following:

def tagged_document(text):
    for i, sent in enumerate(text):
        for j, word in enumerate(sent.split()):
            yield gensim.models.doc2vec.TaggedDocument(word, [j])
data_for_training = list(tagged_document(df['item_name']))
print(data_for_training[3])

Output: [TaggedDocument(words='peanut', tags=[0]), TaggedDocument(words='butter', tags=[1]), TaggedDocument(words='cream', tags=[2])]

model = gensim.models.doc2vec.Doc2Vec(size=150, window=4, min_count=2, workers=10, epochs=30)
model.build_vocab(data_for_training)
model.train(data_for_training, total_examples=model.corpus_count, epochs=model.epochs)
model.save(model.bin)

print(model)
print(list(model.wv.vocab))

Output:

Doc2Vec(dm/m,d150,n5,w4,mc2,s0.001,t10) ['u', 'n', 's', 'a', 'l', 't', 'e', 'd', 'b', 'r', 'c', 'm', 'o', 'k', 'x', 'g', 'p', 'i', 'f', 'h', 'y', 'w', 'v', 'z', 'j', 'q', '7', '2', 'ü', '\x95', 'ñ', '1', '±', 'ç', '5', '4', '0', 'ã', 'ä', 'ù', 'ø', '8', '6', '²', '\x8a', 'ª', '\x82', '\x84', 'ð', '\x9f', '¥', '\x96', '§', '3', '\x91', '¯', '¬', '\xad', '¨', 'â', '\x80', '\x99', 'ï', '¿', '½', '\x93', '9', '©', '¢', '\x97', '\x94', '·', '\x88', '\x8d', '\x83', '\x98', '\x90', '®', 'å', 'é', '\x9d', 'æ', '¡', '¹', '´', '\x8c', '°', '¼', '\x87']

score 0 · Answer 1 · answered Dec 14 '22 at 19:14

First and foremost, the words part of a TaggedDocument should be a list of words. By providing only a single word, it will be treated by Python as a list of single-character 'words'.

So when you supply...

TaggedDocument(tags=[0], words='peanut')

...that's equivalent to...

TaggedDocument(tags[0], words=['p', 'e', 'a', 'n', 'u', 't'])

That's why your final model has only single-character 'words' in it.

If in fact later you want to look-up Doc2Vec document-vectors by the 'Item name' values as look-up keys, you'll want to be sure your code instead creates TaggedDocuments more like:

TaggedDocument(tags=['unsalted butter'], words=['dairy', 'and', 'cream'])

On the other hand, if you want to look-up vectors by 'category' values as look-up keys, then you'll need the categories to be the tags:

TaggedDocument(tags=['dairy and cream'], words=['unsalted', 'butter'])

Which really depends on what you're trying to achieve – what data is supposed to halp you classify into which bins?

And, it's not clear Doc2Vec should be something helpful here, given the data you've shown & task you've described (classification).

Doc2Vec helps turn texts of many words into shorter summary vectors. It's usually demonstrated on texts that are at least as long as sentences, but possibly paragraphs, articles, or even full books. With single words, or short phrases of just a few words, it will have a much harder time learning/providing meaningful vectors.

Do you already have a classifier of any ype, even a poorly-performing one, working on this same data using simpler techniques, such the "bag-of-words" representations available through Scikit-Learn classes like CountVectorizer?

If not, I suggest doing that first, to achieve actual classification on a simpler and more typical base.

Only with that baseline in place, then you could consider using features derived from Word2Vec or Doc2Vec, to see if they help. Unless you have longer multi-word product descriptions, they might not.

Got your point thank you! Since I am trying to classify products names into categories, so as you mentioned Word2Vec is much helpful in my case. But would you guide me how to solve the problem I am facing in word2vec with the taggedDocuments? No I have not used any classifier yet, I am planning to use extract these vectors and fit them in an KNN model to do the classification, what do you think? @gojomo — Mayar Alzerki, Dec 15 '22 at 09:23
Simply supplying an actual list-of-words, not a single-word or single-string, as mentioned in my answer, should resolve the immediate problem. But I'd again urge getting a complete classification system working without any `Word2Vec`/`Doc2Vec` elements first, especially using bag-of-words (or for very-short phrases/product-names, also bag-of-character-ngrams) representations. Only with that working should you consider mixing-in Word2Vec/Doc2Vec-based features, to see if they might help, keeping in mind you may not have the bulk/kind of natural language ideal for training. — gojomo, Dec 15 '22 at 19:48

How to work with Doc2Vec and which approach is better training the model on my dataset or using a pretrained model?

1 Answers1