Best approach for semantic similarity in large documents using BERT or LSTM models

Question

I am trying to build a search application for resumes which are in .pdf format. For a given search query like "who is proficient in Java and worked in an MNC", the output should be the CV which is most similar. My plan is to read pdf text and find the cosine similarity between the text and the query.

However, BERT has a problem with long documents. It supports a sequence length of only 512 but all my CVs have more than 1000 words. I am really stuck here. Methods like truncating the documents don't suit the purpose.

Is there any other model that can do this?

I could not find the right approach with models like Longformer and XLNet for this task.

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" 
model = hub.load(module_url)
print ("module %s loaded" % module_url)

corpus = list(documents.values())
sentence_embeddings = model(corpus)
query = "who is profiecient in C++ and has Rust"
query_vec = model([query.lower()])[0]

doc_names = list(documents.keys())

results = []
for i,sent in enumerate(corpus):
  sim = cosine(query_vec, model([sent])[0])
  results.append((i,sim))
  #print("Document = ", doc_name[i], "; similarity = ", sim)

print(results)
results= sorted(results, key=lambda x: x[1], reverse=True)
print(results)

for idx, distance in results[:5]:
  print(doc_names[idx].strip(), "(Cosine Score: %.4f)" % (distance))

I think, it would be better to add a simple code example for your problem. — Nour-Allah Hussein, Dec 02 '20 at 13:56

score 1 · Answer 1 · answered Mar 13 '21 at 15:02

I advise you to read: Beltagy, Iz, Matthew E. Peters, and Arman Cohan. "Longformer: The long-document transformer." arXiv preprint arXiv:2004.05150 (2020).

The main goal of this paper is that it is able to receive long document sequence tokens as input and is able to process long-term cross-partition context across the document with a linear computational cost.

Here, the sliding window attention mechanism uses n = 512 tokens instead of what is known in the BERT model which takes N=512 tokens as input sequence length.

Longformer: The Long-Document Transformer

GitHub: https://github.com/allenai/longformer

Paper: https://arxiv.org/abs/2004.05150

Best approach for semantic similarity in large documents using BERT or LSTM models

1 Answers1