2

I have been given a doc2vec model using gensim which was trained on 20 Million documents. The 20 Million documents it was trained are also given to me but I have no idea how or which order the documents were trained in from the folder. I am supposed to use the test data to find the top 10 match from the training set. The code I use is -

model = gensim.models.doc2vec.Doc2Vec.load("doc2vec_sample.model")

test_docs=["This is the test set I want to test on."]

def read_corpus(documents, tokens_only=False):
    count=0
    count=count+1
    for line in documents:
        if tokens_only:
            yield gensim.utils.simple_preprocess(line)
        else:
            # For training data, add tags
            yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [count])


test_corpus = list(read_corpus(test_docs, tokens_only=True))

doc_id=0

inferred_vector = model.infer_vector(test_corpus[doc_id])
maxx=10
sims = model.docvecs.most_similar([inferred_vector], topn=maxx)

for match in sims:
    print match

` The output I get is -

(1913, 0.4589531719684601)
(3250, 0.4300411343574524)
(1741, 0.42669129371643066)
(1, 0.4023148715496063)
(1740, 0.3929900527000427)
(1509, 0.39229822158813477)
(3189, 0.387174129486084)
(3145, 0.3842133581638336)
(1707, 0.3813004493713379)
(3200, 0.3754497170448303)

How do I get to know which document does document id "1913" refer to? How can I access the documents of the trained data set from these 10 job ids?

User54211
  • 121
  • 2
  • 11
  • `documents[i]`, wouldn't it be? – cs95 Nov 20 '17 at 06:30
  • documents[i] would refer to the training document, I need the data in the test document. – User54211 Nov 20 '17 at 06:50
  • @User54211 stuck at the same issue. Found any solution..? – Quamber Ali Nov 21 '17 at 14:11
  • 1
    @NSQuamber.java stuck at the same issue. The only solution I found was that when the training set is created, each document in the same sequence would have the same id here however this doesn't help in my case since I have no idea how the training was done. – User54211 Nov 22 '17 at 08:55

2 Answers2

3

The best approach is to ask the person who trained the model how they assigned IDs ('tags' in Doc2Vec parlance) to documents.

If that's not available, look at the training corpus to see if there's any natural naming or ordering that applies to the documents. (Are they one per file? Then perhaps the filenames in sorted order map to ascending IDs. Is each document a line in a single file? Then perhaps the line-number is the ID-tag.

When you have a theory, if the model was a usefully-trained model, then you can test it by seeing if the most_similar() results make sense with that ID-tag interpretation.

You could do this in an ad-hoc fashion – do the results or random probes of query-documents look good to you?

Or you could try to formalize it, for example by re-inferring vectors for documents that were known to be in the training set, then looking for the most-similar documents to those vectors. If the model is good and if the inference is working well (which might require tweaking the infer_vector() parameters, then either the "top hit" for a vector, or one of the top hits, should be for the exact same document.

But really, if the model is so poorly documented you can't correlate the documents to the IDs, and the original person isn't available, you may want to throw it out and re-train a document with better-documented procedures.

gojomo
  • 52,260
  • 14
  • 86
  • 115
0

Simply print documents into a list and query the 20 Million list. Of course, you don't want to do print(documents) and get 20 million vectors in your screen. It may be more efficient to insert the list in documents into a database table. When you print the documents vector (i.e., train_corpus from gensim doc2vec tutorial), the result is a list in the following format: [TaggedDocument(words=['token1', 'token2',..., 'tokenn'], tags=[document number]). You can query this result to find the 1913th document in the list.

Vera
  • 84
  • 10