I am extracting the word embeddings corresponding to a list of words from ChatGPT API. I was wondering if there is a way similar to Gensim most_similar method to extract the n words that are most similar to my desired terms in the entire model.
Asked
Active
Viewed 36 times
1 Answers
1
Yes, if you have a Gensim word-vector model, you can use the .most_similar()
method to get a report of the most-similar words to a supplied target word/vector.
Usage is explained extensively in the Gensim docs:
For example, you can supply a single word:
similars = kv_model.most_similar('apple')
You can also supply a list of words as the named positive
parameter, and it will return the words most-similar to your positive-examples' average vector:
similars = kv_model.most_similar(positive=['apple', 'orange', 'melon')])
You can use the top_n
parameter to return more or fewer than the defaul 10 nearest-neighbors.

gojomo
- 52,260
- 14
- 86
- 115
-
Thanks for your response but the issue is that I don't think there is a way to access the entire ChatGPT models. So, I was able to only extract word level embeddings for a few words.... I was just curious if that functionality is somehow available through the API. – Marj Aug 31 '23 at 16:06
-
Aha. Which ChatGPT API are you referring to? (I've heard of an OpenAPI API endpoint for getting embeddings of multiword texts, but not specifically for single words - though perhaps giving that same endpoint single words would work alright.) If you were able to extract "word level embeddings for a few words", what happened on the rest of the words you want embeddings for? (Was there an error? Some other response? Poor quality responses?) – gojomo Aug 31 '23 at 16:41
-
I am using the OpenAPI endpoint to get the word level embeddings from text-embedding-ada-002 model. I am using their "get-embedding" method in a function like this: def get_embedding_with_na(word, engine='text-embedding-ada-002'): try: embedding = get_embedding(word, engine=engine) return embedding except KeyError: return "NA" I could extract the word level embeddings for all the words I was looking for but I'm wondering if there is a way to get the 20 most similar terms to these words in the entire corpus. – Marj Aug 31 '23 at 17:25
-
I see. I've not heard of, nor can I find in OpenAI docs, any reference to an API that returns nearest-neighbors from their full vocabulary – nor even any way to enumerate their entire vocabulary, which may be proprietary. That their docs page on `Embeddings` (https://platform.openai.com/docs/guides/embeddings/how-can-i-retrieve-k-nearest-embedding-vectors-quickly) specifically recommends external vector-databases for this task implies they have no offering. So you'd probably have to enumerate your own vocabulary, collect all the vectors locally, and do your own search. – gojomo Aug 31 '23 at 18:18
-
I would say that starting from a root word, or set of seed words, directly asking the full conversational ChatGPT for "related words", repeatedly, might be an interesting strategy for enumerating neighborhoods its knows, or recursively 'crawling' their vocabulary – albeit at far greater computational/API expense than simpler retrievals or bulk raw cosine-distance calcs (which could be repeated across the words found via prompting). – gojomo Aug 31 '23 at 18:20
-
Further, while I'd expect OpenAI's scale to mean their vectors are based on a lot of good training data & plentiful processing, I read the descriptions of that API as providing a vector for *multiword texts* rather than simple single-word-token vectors. It's not clear or certain that the vector for the 1-word text of a word is, or should be, or would be as useful as, the single-word-vector itself. For example, 'stop' alone as 1-word utterance has strong 'imperative verb' connotations, while the word more generally across all its usage contexts has many related but different shades-of-meaning. – gojomo Aug 31 '23 at 18:24
-
1This is a very good suggestion to collect similar words by recursive crawling... Also as you mentioned, I've started having some concerns about the quality of the word level embeddings. That's why I decided to check the context through the neighboring words. I'm less familiar with generative models in general and I'm not sure if they perform similar to other conventional models such as W2V. I'll probably need to do some more research about this. – Marj Aug 31 '23 at 18:42