0

i have two separate data sets, one is resumes and the other is demands, using gensim doc2vec, i created models for each and i am able to query similar words in each data sets, but now, i need to merge these two models into one and query for resumes in demands and attain the similarity or matching between them. My data sets are in plain txt files in which the the two resumes or demands are separated by * . Please find my implementation below, any suggestions would be highly appreciated. Thanks.

import gensim
import os
import collections
import smart_open
import random


def read_corpus(fname, tokens_only=False):

    with open(fname) as f:
      i=0
      for  line in (f.read().split('&&')):
        if len(line)>1:
            if tokens_only:
                yield gensim.utils.simple_preprocess(line)
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [i])
            i+=1    


vocabulary = read_corpus('D:\Demand.txt')
train_corpus = list(vocabulary)
print(train_corpus[:2])

model = gensim.models.doc2vec.Doc2Vec(size=50, min_count=2, iter=55)
model.build_vocab(train_corpus)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.iter)
print(model.infer_vector(['trainings', 'certifications', 'analyst', 'unix', 'jdbc','testing']))
model.docvecs.most_similar(positive=[model.infer_vector(['spark', 'sqoop'])])
model.most_similar('unix')
krits
  • 68
  • 1
  • 9
  • Can you give samples for each data set? – WolfgangK Apr 13 '18 at 07:04
  • hi @WolfganK, thanks for considering the query, sure, please find the two files as edited in the question – krits Apr 13 '18 at 08:53
  • @WolfgangK, due to the size, i cannot put a sample file, because if i put a reduced file then the matching will not give results, for data set, you can take any text files, it is not necessary to have * as separator , any two text files can be taken into account and matching needs to be done between them, for instance two separate text documents about sports or any other news topic to find the similarity in them and check for the better writer – krits Apr 13 '18 at 09:10
  • It is still unclear to me what you are trying to do. As a reproducible input you could use two subsets of the 20newsgroups dataset, available e.g. through the `sklearn` module. Here is the [link to the documentation](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html). Furthermore, it would be helpful to know what your code is supposed to do exactly. Where does it fail, and how should your expected output look like? – WolfgangK Apr 14 '18 at 10:28
  • as i said i am trying to get the matching between two documents, and my code is taking in a single document and then i can get the similar words or matching within that single document, but i want to include functionalities through which i can give two documents as input and then find the similar words and matching between them. You can run the code taking a txt file as input, the same file you mentioned 20newsgroup – krits Apr 16 '18 at 07:53

0 Answers0