1

I've got a dataset of job postings with about 40 000 records. I extracted skills from descriptions using NER with about 30 000 skills in the dictionary. Every skill is represented as an unique identificator.

The distribution of skills number for a posting looks like that:

mean 15.12 | std 11.22 | min 1.00 | 25% 7.00 | 50% 13.00 | 75% 20.00 |

I've trained a word2vec model using only skill ids and it works more or less fine. I can find most similar skills to a given one and the result looks okay.

But when it comes to a doc2vec model I'm not satisfied with the result.

I have about 3200 unique job titles, most of them have only few entries and there are quite a few of them being from the same field ('front end developer', 'senior javascript developer', 'front end engineer'). I delibirately went for a variety of job titles which I use as tags in doc2vec.TaggedDocument(). My goal is to see a number of relevant job titles when I input a vector of skills into docvecs.most_similar().

After training a model (I've tried different number of epochs (100,500,1000) and vector sizes (40 and 100)) sometimes it works correctly, but most of the time it doens't. For example for a skills set like [numpy, postgresql, pandas, xgboost, python, pytorch] I get the most similar job title with a skill set like [family court, acting, advising, social work].

Can it be a problem with the size of my dataset? Or the size of docs (I consider that I have short texts)? I also think that I misunderstand something about doc2vec mechanism and just ignore it. I'd also like to ask if you know any other, maybe more advanced, ideas how I can get relevant job titles from a skill set and compare two skill set vectors if they are close or far.

UPD:

Job titles from my data are 'tags' and skills are 'words'. Each text has a single tag. There are 40 000 documents with 3200 repeating tags. 7881 unique skill ids appear in the documents. The average number of skill words per document is 15.

My data example:

         job_titles                                             skills
1  business manager                 12 13 873 4811 482 2384 48 293 48
2    java developer      48 2838 291 37 484 192 92 485 17 23 299 23...
3    data scientist      383 48 587 475 2394 5716 293 585 1923 494 3

The example of my code:

def tagged_document(df):
    #tagging documents
    for index, row in df.iterrows():
        yield gensim.models.doc2vec.TaggedDocument(row['skills'].split(), [row['job_title']])


data_for_training = list(tagged_document(job_data[['job_titles', 'skills']])

model_d2v = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=100)

model_d2v.train(data_for_training, total_examples=model_d2v.corpus_count, epochs=model_d2v.epochs)

#the skill set contains close skills which represent a front end developer
skillset_ids = '12 34 556 453 1934'.split()                                                  
new_vector = model_d2v.infer_vector(skillset_ids, epochs=100)
model_d2v.docvecs.most_similar(positive=[new_vector], topn=30)

I've been experimenting recently and noticed that it performs a little better if I filter out documents with less than 10 skills. Still, there are some irrelevant job titles coming out.

Niko D
  • 25
  • 4

1 Answers1

1

Without seeing your code (or at least a sketch of its major choices), it's hard to tell if you might be making shooting-self-in-foot mistakes, like perhaps the common "managing alpha myself by following crummy online examples" issue: My Doc2Vec code, after many loops of training, isn't giving good results. What might be wrong?

(That your smallest number of tested epochs is 100 seems suspicious; 10-20 epochs are common values in published work, when both the size of the dataset and size of each doc are plentiful, though more passes can sometimes help with thinner data.)

Similarly, it's not completely clear from your description what your training docs are like. For example:

  • Are the tags titles and the words skills?
  • Does each text have a single tag?
  • If there are 3,200 unique tags and 30,000 unique words, is that just 3,200 TaggedDocuments, or more with repeating titles?
  • What's the average number of skill-words per TaggedDocument?

Also, if you are using word-vectors (for skills) as query vectors, you have to be sure to use a training mode that actually trains those. Some Doc2Vec modes, such as plain PV-DBOW (dm=0) don't train word-vectors at all, but they will exist as randomly-initialized junk. (Either adding non-default dbow_words=1 to add skip-gram word-training, or switching to PV-DM dm=1 mode, will ensure word-vectors are co-trained and in a comparable coordinate space.)

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • I think he tries to compare a set of words to documents. Is that even a good comparison. Should you compare documents with documents, rather than just a collection of words. – Borut Flis Oct 02 '20 at 21:46
  • Some modes of `Doc2Vec` train doc- and word- vectors into the same coordinates, and this can lead to useful/interpretable results – see eg https://arxiv.org/abs/1507.07998. – gojomo Oct 02 '20 at 22:34
  • Hi gojomo, many thanks for your reply! I've added the info you asked in my initial question. – Niko D Oct 03 '20 at 12:03
  • I don't see anything obviously wrong with your setup, but I would sample some of the items in `data_for_training`, and look over `model_d2v.docvec.doctags` and `model_d2v.wv.index2word` to double-check what you think is being trained is. Your data is on the smaller size but may be enough to also try larger `vector_size` values. If skills order is irrelevant, you could also try a very-large `window` (eg `window=1000000`) to essentially make all skills equal neighbors of each other, in modes using `window`, rather than more-influenced by the closest. – gojomo Oct 03 '20 at 13:29
  • When you're not actually using the word-vectors, you could try plain DBOW model (`dm=0`) - it often works very well with short docs, and will still work with your "infer-then-check-doc-neighbors" approach. Alternatively, if using default `dm=1` or `dm=0, dbow_words=1`, you could try looking up title neighbors of skills (`model_dv.docvecs.most_similar(positive=[model_dv.wv['12'])`) or vice-versa (`model_dv.wv.most_similar(positive=[model_dv.docvecs['java developer']`). For non-natural-language data, non-default `ns_exponent` values may also help. – gojomo Oct 03 '20 at 13:34
  • 1
    Thank you very much for you recommendations, gojomo! I think that dm=0, dbow_words=1 have been a game changer. The result is much better at the moment. – Niko D Oct 06 '20 at 09:53