From here
We just faced the same issues for the first time here when using the
openai-python package.
We did some tests and around 11% of them were considerably different,
even being near in the vector space.
UPDATE: For anyone facing this issue, the embeddings’ endpoint is
deterministic. The reason to this difference is caused by the OpenAI
Python package, as it uses base64 as the default encoding format,
while others don’t.
if you dive into the ibrary code:
class Embedding(EngineAPIResource):
OBJECT_NAME = "embeddings"
@classmethod
def create(cls, *args, **kwargs):
start = time.time()
timeout = kwargs.pop("timeout", None)
user_provided_encoding_format = kwargs.get("encoding_format", None)
# If encoding format was not explicitly specified, we opaquely use base64 for performance
if not user_provided_encoding_format:
kwargs["encoding_format"] = "base64"
from this github repo
Displaying coordinates of text embeddings retrieved using the OpenAI
Python library shows more digits than when the embeddings are
retrieved explicitly from the API endpoint or using most other
libraries. This repository explores why that is, how to get this
behavior (and by the same mechanism) when working in other languages,
and why one should not usually bother to do so.
More specifically, this repository is a collection of code examples
and documentation for the encoding_format argument to the OpenAI
embeddings API, which, when set to base64, will send raw floats
encoded in Base64. The OpenAI Python library uses that under the hood.