0

I'm trying to train a Doc2Vec model in order to create a multi-label text classifier.
In order to do that, i have chosen a data set that contains approximately 70000 article, and every article contains between 1500 and 2000 words.
These articles are divided into 5 classes.
while setting up my input, i chosen as tag for my document their corresponding label. I have done it as follow : tagged_article = data.apply(lambda r: TaggedDocument(words=r['article'].split(), tags=[r.labels]), axis=1)
then i have trained my model with the following line codes:

model_dbow = Doc2Vec(dm=1, vector_size=300, negative=5, min_count=10, workers=cores)
model_dbow.build_vocab([x for x in tqdm(tagged_article.values)])

print("Training the Doc2Vec model for ", no_epochs, "number of epochs" )
for epoch in range(no_epochs):
     model_dbow.train(utils.shuffle([x for x in tqdm(tagged_article.values)]),total_examples=len(tagged_article.values), epochs=1)
     model_dbow.alpha -= 0.002
     model_dbow.min_alpha = model_dbow.alpha   

After that I have created a logistic regression model in order to predict tags for every article.

To do that I have created the following functions:\

def vec_for_learning(model, tagged_docs):
sents = tagged_docs.values
targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words, steps=inference_steps)) for doc in tqdm(sents)])
return targets, regressors

y_train, X_train = vec_for_learning(model_dbow, tagged_article)

logreg = LogisticRegression(solver='lbfgs',max_iter=1000)
logreg.fit(X_train, y_train)

Unfortunately i am getting a very bad result. In fact I'm getting 22% as accuracy rate and 21 % as an F1 score

Can you please explain me why i am getting these bad results.

1 Answers1

0

First & foremost, you almost certainly don't want to use your own loop to call train() multiple times while managing alpha yourself. See: My Doc2Vec code, after many loops of training, isn't giving good results. What might be wrong?

As you don't show your no_epochs value, I can't be sure you're doing the absolute worst thing - eventually decrementing alpha to a negative value – but you might be. Still, there's no need for that error-prone loop. (And, you may want to contact whatever source suggested this code template to you and let them know they are promoting an anti-pattern.)

It is probably also a mistake to use your just 5 known-labels as the document-tags. That means the model is essentially only learning 5 doc-vectors, as if all articles were just fragments of 5 giant texts. While it's sometimes helpful to use (or add) known-labels as tags, the more classic manner of training Doc2Vec gives each document a unique ID, so the model is learning (in your case) about 70,000 distinct doc-vectors, and may more richly model the document-possibility spaces spanned, in various irregular shapes, by all your documents and labels.

While your data is certainly of a size comparable to published work that shows the value of the Doc2Vec algorithm, your corpus isn't gigantic (and it's unclear how large & diverse your vocabulary might be). So it's possible that 300 dimensions is oversized for the quanitity/variety of data you have, or min_count=10 too aggressive (or not aggressive enough) in trimming less-important & less-well-sampled words.

Finally, note that the Doc2Vec class will inherit a default epochs value of 5, but most published work uses 10-20 training epochs, and often with smaller datasets even more can be helpful. Additionally, inference will reuse the same epochs set (or defaulted) at model creation, and works best with (at least) the same number of epochs as training - while it's unclear what inference_steps you're using.

(As a separate matter of code legibility: you've named your model model_dbow, but by using dm=1 you're actually using PV-DM mode, not PV-DBOW mode.)

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • In fact i have used the default value of alpha, so i guess it will become negative after some epochs. i will try to apply all your remark and i hope it become better – firas_frikha Nov 17 '20 at 19:26
  • Just one last question: can I use for example as tag for each document a list that contains two values such as for example the document_id and its label ? – firas_frikha Nov 17 '20 at 19:33
  • Yes, the `tags` must be a list, but in the classic case it's a list with just one element, a unique document ID. But you can add another tag, such as a known label. Whether that helps or hurts the overall model is something you'd have to test. Having 2 tags will essentially double training effort in true DBOW mode. (If your hope in providing these is to get a single summary doc-vector for each label, note that such reduction of complex category-shapes in the vector space to just a single point may gloss-over the actual ranges of varied document-themes within the category.) – gojomo Nov 17 '20 at 20:21