-1

Given a generated doc2vec vector on some document. is it possible to reverse the vector back to the original document? If so, does there exist any hash algorithm that would make the vector irreversible but still comparable to other vectors of the same type (using cosine/Euclidean distance)?

1 Answers1

0

It's unclear why you've mentioned "TF-IDF vector" in your question title, but then asked about a Doc2Vec vector – which is very different from a TF-IDF approach. I'll assume your main interest is Doc2Vec vectors.

In general, a Doc2Vec vector has far too little information to actually reconstruct the document for which the vector was calculated. It's essentially a compressed summary, based on evolving (in training or inference) a vector that's good (within the limits of the model) at predicting the document's words.

For example, one commonly-used dimensionality for Doc2Vec vectors is 300. Those 300 dimensions are each represented by a 4-byte floating-point value. So the vector is 1200 bytes in total - but could be the summary vector for a document of many hundreds or thousands of words, far far larger than 1200 bytes.

It's theoretically plausible that with a Doc2Vec vector, and the associated model from which it was trained or inferred, you could generate a ranked list of words most-likely to be in the document. There's a pending feature-request to offer this in Gensim (#2459), but not yet implementing code. But such a list-of-words wouldn't be grammatical, and the top 10 words in such a list might not be in the document at all. (It might be entirely made up of other similar words.)

With a large set of calculated vectors, as you get when training of a model has finished, you could take a vector (from that set, or from inferring a new text), and look through the set-of-vectors for whichever one has a vector closest to your query vector. That would point you at one of your known documents - but that's more of a lookup (when you already know many example documents) than reversing a vector into a document directly.

You'd have to say more about your need for a 'irreversible' vector that is still good for document-to-document comparisons for me to make further suggestions to meet that need.

To an extent, Doc2Vec vectors already meet that need, as they can't regenerate an exat document. But given that they could generate a list of likely words (per above), if your needs are more rigorous, you might need extra steps. For example, if you used a model to calcualte all needed vectors, but then threw away the model, even that theoretical capability to list most-likely words would go away.

But to the extent you still have the vectors, and potentially their mappings to full documents, a vector still implies one, or a few, closest-documents from the known set. And even if you somehow had a novel vector, without its text, simply looking among your known documents that are closest would be highly suggestive (but not dispositive) about what words are in the source document.

(If your needs are very demanding, there might be something in the genre of 'Fully Homomorphic Encryption' and/or 'Private Information Retrieval' that would help. Those use advanced cryptography to allow queries on encrypted data that only reveal final results, hiding the details of what you're doing even from the system answering your query. But those techniques are far more new & complicated, with few if any sources of ready-to-use code, and adapting them specifically for vector-similarity style calculations might require significant custom advanced-cryptography work.)

gojomo
  • 52,260
  • 14
  • 86
  • 115