Finding nearest neighbour words (synonyms) in a BERT model

Question

I'm looking to recreate the BERT synonyms assessment found in this paper (https://www.mdpi.com/2673-6470/2/4/30), where from what I gather, the following steps need to be taken:

Pairwise similarities between all the words in BERT’s vocabulary are computed.
Using KDTree algorithm from the Python’s Sklearn library, a search index is built upon the matrix computed in step 1 to allow for fast querying.

I have created the code below to do this, that does return a list of words, however I'm not convinced it is working properly as the words can be a bit random and would have expected closer associations. I'm interested to see if anyone has done something similar or can spot obvious errors in the code that result in the result not being what I intended...

import numpy as np
import torch
from sklearn.neighbors import KDTree
from transformers import BertTokenizer, BertModel


# Load the BERT-like model and tokenizer - check correct model
model_name = 'deepset/bert-base-cased-squad2'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)
model.eval()

# Get the vocabulary and its size
vocab = tokenizer.get_vocab()
vocab_size = len(vocab)

# Compute embeddings for each word in the vocabulary
word_embeddings = []
for word in vocab:
    input_ids = tokenizer.encode(word, add_special_tokens=False)
    input_ids = torch.tensor(input_ids).unsqueeze(0)
    with torch.no_grad():
        outputs = model(input_ids)
        word_embedding = outputs.last_hidden_state.mean(dim=1).numpy()
    word_embeddings.append(word_embedding)
word_embeddings = np.vstack(word_embeddings)

# Compute pairwise similarities between word embeddings
pairwise_similarities = np.dot(word_embeddings, word_embeddings.T)

# Build KDTree for fast querying
kdtree = KDTree(word_embeddings)

# Function to find similar words
def find_similar_words(query_word, num_neighbors=10):
    query_word = query_word.lower()
    query_embedding = None
    if query_word in vocab:
        query_embedding = word_embeddings[vocab[query_word]]

    if query_embedding is None:
        return []

    _, indices = kdtree.query(query_embedding.reshape(1, -1), k=num_neighbors+1)

    similar_words = [tokenizer.decode([vocab_id]) for vocab_id in indices[0]]
    return similar_words[1:]  # Exclude the query word itself

# Example usage
query_word = "debris"
similar_words = find_similar_words(query_word)
print(f"Words similar to '{query_word}': {similar_words}")

Welcome to Stackoverflow! Asking for recommendations might not be appropriate on the Stackoverflow (https://stackoverflow.com/help/how-to-ask) but it might be possible to ask the question on https://softwarerecs.stackexchange.com Also, logging it on https://stackoverflow.com/collectives/nlp/beta/discussions/76949597 — alvas, Aug 25 '23 at 16:26

Finding nearest neighbour words (synonyms) in a BERT model

0 Answers0