-1

Let's say I have many documents with a question and an answer. I want to build an embedding where I can find the most similar documents based on just a new question without an answer but still be able to find similar documents based on the whole document, meaning question and answer.

What would be the best way where I only need one embedding?

I thought of some possible approaches, but here I would need to have two different embeddings:

  1. Split all documents into questions and answers and build two different embeddings. One question- and one answer-embedding. Now, if I want to find the most similar doc for a question, I will just use the question-embedding. When I want to find the most similar doc based on a new doc I will split the new doc and find the most similar vectors in both embeddings and calculate something like an average(question_vec, answer_vec).

  2. I create a question-only-embedding and a whole-doc-embedding. Here I can just use an embedding depending on the task.

1 Answers1

1

If you need to match questions & answers separately, creating separate embeddings for each subsection may be a good strategy – & doesn't preclude also creating an embedding for both combined (or other subsections of a long answer).

But what's best will depend on your corpus & domain-specific goals/challenges – so ultimately you need to try multiple approaches & pick the one that scores best on some repeatable evaluation, driven by your own labeled data about desirable associations.

Often the text-embeddings of questions & answers may not be highly-similar, in that the kinds of words/phrasing used in questions may suggest, without closely mimicking, what would be in a related answer.

So an extra level of indirection might be helpful: not just doing raw embedding-similarity search, but trying to learn: "for this sort of question-embedding, which other answer-embeddings are most useful?" (This could involve-learning a separate mapping that's more complex than just finding the most-similar vector.)

A new interesting idea in question-&-answer embeddings is to leverage Large Language Models (LLMs) like ChatGPT etc to generate initial answer text(s), and you don't even care whether they're right or wrong, just that they tend to use the right sorts of words for a good answer. You then use those possible-hallucinatory-junk pseudoanswers to generate "Hypothetical Document Embeddings", then report from your set of (better, likely-true) answers the closest texts to those probes. See the "HyDE" (Hypothetical Document Embeddings) paper for more details:

https://arxiv.org/abs/2212.10496

gojomo
  • 52,260
  • 14
  • 86
  • 115