2

I have created the following piece of code using Jupyter Notebook and langchain==0.0.134 (which in my case comes with openai==0.27.2). The code takes a CSV file and loads it in Chroma using OpenAI Embeddings.

CSV

COLUMN1;COLUMN2
Hello;World
From;CSV

Jupyter Notebook

#!/usr/bin/env python
# coding: utf-8
get_ipython().run_line_magic('load_ext', 'dotenv')
get_ipython().run_line_magic('dotenv', '')

# ### CSV Load
from langchain.document_loaders.csv_loader import CSVLoader

csv_args = {"delimiter": ";",
            "quotechar": '"',
           'fieldnames': ['COLUMN1','COLUMN2']}
loader = CSVLoader(file_path='./data/stack-overflow-test.csv', csv_args=csv_args)

# ### Load in Chroma
from langchain.vectorstores import Chroma
from langchain.indexes import VectorstoreIndexCreator
from langchain.embeddings.openai import OpenAIEmbeddings

index_creator = VectorstoreIndexCreator(
    vectorstore_cls=Chroma,
    embedding=OpenAIEmbeddings(),
    vectorstore_kwargs= {"collection_name": "collection"}
)

# This is the line of code that is recorded with the "packet analyzer"
indexWrapper = index_creator.from_loaders([loader])

If I check the request (using Wireshark), I obtain the following:

Request

POST /v1/engines/text-embedding-ada-002/embeddings HTTP/1.1
Host: api.openai.com
User-Agent: OpenAI/v1 PythonBindings/0.27.2
Content-Type: application/json

{
  "input": [
    [82290, 16, 25, 76880, 82290, 16, 40123, 17, 25, 40123, 17],
    [82290, 16, 25, 22691, 40123, 17, 25, 4435],
    [82290, 16, 25, 5659, 40123, 17, 25, 28545]
  ],
  "encoding_format": "base64"
}

Reply

openai-version: 2020-10-01
Content-Type: application/json
{
    "object": "list",
    "data": [
      {
        "object": "embedding",
        "index": 0,
        "embedding": ""
      },
      {
        "object": "embedding",
        "index": 1,
        "embedding": ""
      },
      {
        "object": "embedding",
        "index": 2,
        "embedding": ""
      }
    ],
    "model": "text-embedding-ada-002-v2",
    "usage": {
      "prompt_tokens": 27,
      "total_tokens": 27
    }
  }

I want to understand what is happening behind the scenes as this kind of debugging is useful for troubleshooting. As you can see, the payload has an input field with a matrix of numbers, but it does not make sense to me (it does not match the documentation).

So I have two questions:

  1. Why does the input field have this matrix of numbers?
  2. How can I decode the answer? I couldn't create the vector I am supposed to receive when I decode the embedding field from the answer using Base64.

It looks like the Python client from OpenAI uses an older version of the API (can be that the reason? I didn't use the API before).

ChatGPT mentioned

The tokens are represented by numerical IDs such as 82290, 16, 25, etc., which likely correspond to a vocabulary or tokenization scheme used by OpenAI

However, it does not provide references and I would like to have them. It might be related to one of this tools Tiktoken, Huggingface Tokenizer

Edu
  • 159
  • 1
  • 14

0 Answers0