I'm trying to run the llama index model, but when I get to the index building step - it fails time and time again, how can I fix this?

Question

I'm trying to use the llama_index model which builds an index from your personal documents, and allows you to ask questions about the information from the GPT chat.

This is the full code (of course with my API):

import os
os.environ["OPENAI_API_KEY"] = 'YOUR_OPENAI_API_KEY'

from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader('data').load_data()
index = GPTSimpleVectorIndex.from_documents(documents)

When I run the index build according to the steps in their documentation, it fails at this step:

index = GPTSimpleVectorIndex.from_documents(documents)

with the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\COLMI\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\indices\base.py", line 92, in from_documents
    service_context = service_context or ServiceContext.from_defaults()
  File "C:\Users\COLMI\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\indices\service_context.py", line 71, in from_defaults
    embed_model = embed_model or OpenAIEmbedding()
  File "C:\Users\COLMI\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\embeddings\openai.py", line 209, in __init__
    super().__init__(**kwargs)
  File "C:\Users\COLMI\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\embeddings\base.py", line 55, in __init__
    self._tokenizer: Callable = globals_helper.tokenizer
  File "C:\Users\COLMI\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\utils.py", line 50, in tokenizer
    enc = tiktoken.get_encoding("gpt2")
  File "C:\Users\COLMI\AppData\Local\Programs\Python\Python310\lib\site-packages\tiktoken\registry.py", line 63, in get_encoding
    enc = Encoding(**constructor())
  File "C:\Users\COLMI\AppData\Local\Programs\Python\Python310\lib\site-packages\tiktoken_ext\openai_public.py", line 11, in gpt2
    mergeable_ranks = data_gym_to_mergeable_bpe_ranks(
  File "C:\Users\COLMI\AppData\Local\Programs\Python\Python310\lib\site-packages\tiktoken\load.py", line 83, in data_gym_to_mergeable_bpe_ranks
    for first, second in bpe_merges:
ValueError: not enough values to unpack (expected 2, got 1)

I should mention that I tried this on DOCX files inside a specific folder that contains such files and folders, also inside subfolders.

Can you share the code snippet on how you are generating the documents object? — Nischal Hp, Apr 09 '23 at 12:36
The code snippet is exactly the same as in the GitHub documentation. The error also exists when I set the value for indexing as "documents": `documents = SimpleDirectoryReader('documents').load_data()` — NH.LOCAL, Apr 09 '23 at 16:41
Do you mean Directory Loader? I do not see any class for SimpleDirectoryReader in Langchain. What is the type of documents you are reading and where are they stored? — Nischal Hp, Apr 11 '23 at 06:15
I don't know the library. I executed the commands according to the llama index documentation. The document type is docx. And they are stored in the current directory (cd) where I ran the command. — NH.LOCAL, Apr 12 '23 at 16:56
I was surprised to experience that if you provide a wrong path to the loader, it will not fail. Is 'documents/' a subdirectory in your current directory, and does it contain your files? You may want to check that the loaded list is not empty. — Tanguy A., Apr 14 '23 at 09:20
No. The directory containing the documents is my current directory. It definitely contains many docx files, although in subfolders and not in the main folder. I should note that the full path of the current directory is in Hebrew, but the error also occurred when the CD was directed to other directories — NH.LOCAL, Apr 15 '23 at 17:03
You should provide more information, such as what kind of documents they are, show that the directory contains them, the code of your program and everything necessary so that whoever solves your question does not have to spend their time asking. — Edgardo Genini, Apr 18 '23 at 00:56

score 1 · Answer 1 · answered Apr 28 '23 at 18:50

1

You must set a recursive argument to True, if your files are in subfolders:

documents = SimpleDirectoryReader('documents', recursive=True).load_data()

answered Apr 28 '23 at 18:50

Max

11
1

Your explanation does seem to answer the question. But now there is another error when running – NH.LOCAL May 04 '23 at 15:22
What is the error message you are receiving? – Max May 05 '23 at 17:49
The error seems to be due to openai's restrictions on using their api. This is the error: ```openai.error.RateLimitError: You exceeded your current quota, please check your plan and billing details.``` – NH.LOCAL Aug 12 '23 at 19:25

score 0 · Accepted Answer · answered Aug 12 '23 at 19:33

I seem to have had a problem with the whole code usage approach.

The value 'data' is not used as a parameter for defining a function, but simply marks an example of a folder name that contains the user's files.

A local path can be used like:

documents = SimpleDirectoryReader('my_folder').load_data()

or in a fixed path, such as:

documents = SimpleDirectoryReader('c:\users\user\my_files').load_data()

If you use this approach, everything will work as expected.

I'm trying to run the llama index model, but when I get to the index building step - it fails time and time again, how can I fix this?

2 Answers2