2

I've trained Doc2Vec model I'm trying to get predictions.

I use

test_data = word_tokenize("Филип Моррис Продактс С.А.".lower())
model = Doc2Vec.load(model_path)
v1 = model.infer_vector(test_data)
sims = model.docvecs.most_similar([v1])
print(sims)

returns

[('624319', 0.7534812092781067), ('566511', 0.7333904504776001), ('517382', 0.7264763116836548), ('523368', 0.7254455089569092), ('494248', 0.7212602496147156), ('382920', 0.7092794179916382), ('530910', 0.7086726427078247), ('513421', 0.6893941760063171), ('196931', 0.6776881814002991), ('196947', 0.6705600023269653)]

Next I've tried to know, what's text of this number

model.docvecs['624319']

But it returns me only the vector representation

array([ 0.36298314, -0.8048847 , -1.4890883 , -0.3737898 , -0.00292279,
   -0.6606688 , -0.12611026, -0.14547637,  0.78830665,  0.6172428 ,
   -0.04928801,  0.36754376, -0.54034036,  0.04631123,  0.24066721,
    0.22503968,  0.02870891,  0.28329515,  0.05591608,  0.00457001],
  dtype=float32)

So, is any way to get text of this label from the model? Loading train dataset takes a lot of time, so I try to find out another way.

Petr Petrov
  • 4,090
  • 10
  • 31
  • 68

1 Answers1

8

There is no way to convert a doc vector directly back into the original text (the information about word ordering, etc is lost in the process of reduction of text --> vectors).

However, you can retrieve the original text by tagging each document with its index in your corpus list when you are creating your TaggedDocuments for Doc2Vec(). Let's say you had a corpus of sentences/documents that are contained in a list called texts. Use enumerate() like this to generate a unique index i for each sentence, and pass that as the tags argument for TaggedDocument:

tagged_data = []
for i, t in enumerate(texts):
    tagged_data.append(TaggedDocument(words=word_tokenize(c.lower()), tags=[str(i)]))

model = Doc2Vec(vector_size=VEC_SIZE,
                window=WINDOW_SIZE,
                min_count=MIN_COUNT,
                workers=NUM_WORKERS)

model.build_vocab(tagged_data)

Then after training, when you get the results from model.docvecs.most_similar(), the first number in each tuple will be the index into your original list of corpus texts. So for example, if you run model.docvecs.most_similar([some_vector]) and get:

[('624319', 0.7534812092781067), ('566511', 0.7333904504776001), ('517382', 0.7264763116836548), ('523368', 0.7254455089569092), ('494248', 0.7212602496147156), ('382920', 0.7092794179916382), ('530910', 0.7086726427078247), ('513421', 0.6893941760063171), ('196931', 0.6776881814002991), ('196947', 0.6705600023269653)]

... then you could retrieve the original document for the first result('624319', 0.7534812092781067) by indexing into your initial corpus list with: texts[624319].

Or if you wanted to loop through and get all of the most similar texts, you could do something like:

most_similar_docs = []
for d in model.docvecs.most_similar([some_vector]):
    most_similar_docs.append(texts[d[0]])
J. Taylor
  • 4,567
  • 3
  • 35
  • 55
  • The downside of this simple method is that it requires keeping all of the texts and the Doc2Vec model in memory. However, if it is too large for memory, you can just store the corpus in a database, keyed by the same indices. You could then retrieve the entries with a id2doc() function that queries the DB. – J. Taylor Feb 17 '19 at 22:18