I'm trying to train a Doc2Vec model in order to create a multi-label text classifier.
In order to do that, i have chosen a data set that contains approximately 70000 article, and every article contains between 1500 and 2000 words.
These articles are divided into 5 classes.
while setting up my input, i chosen as tag for my document their corresponding label.
I have done it as follow :
tagged_article = data.apply(lambda r: TaggedDocument(words=r['article'].split(), tags=[r.labels]), axis=1)
then i have trained my model with the following line codes:
model_dbow = Doc2Vec(dm=1, vector_size=300, negative=5, min_count=10, workers=cores)
model_dbow.build_vocab([x for x in tqdm(tagged_article.values)])
print("Training the Doc2Vec model for ", no_epochs, "number of epochs" )
for epoch in range(no_epochs):
model_dbow.train(utils.shuffle([x for x in tqdm(tagged_article.values)]),total_examples=len(tagged_article.values), epochs=1)
model_dbow.alpha -= 0.002
model_dbow.min_alpha = model_dbow.alpha
After that I have created a logistic regression model in order to predict tags for every article.
To do that I have created the following functions:\
def vec_for_learning(model, tagged_docs):
sents = tagged_docs.values
targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words, steps=inference_steps)) for doc in tqdm(sents)])
return targets, regressors
y_train, X_train = vec_for_learning(model_dbow, tagged_article)
logreg = LogisticRegression(solver='lbfgs',max_iter=1000)
logreg.fit(X_train, y_train)
Unfortunately i am getting a very bad result. In fact I'm getting 22% as accuracy rate and 21 % as an F1 score
Can you please explain me why i am getting these bad results.