1

I am new to Langchain and I am stuck at an issue. My end goal is to read the contents of a file and create a vectorstore of my data which I can query later.

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.document_loaders import TextLoader


loader = TextLoader("elon_musk.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

It looks like there is some issue with my data file and because of this, it is not able to read the contents of my file. Is it possible to load my file in utf-8 format? My assumption is with utf-8 encoding I should not face this issue.

Following is the error I am getting in my code:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
File ~\anaconda3\envs\langchain-test\lib\site-packages\langchain\document_loaders\text.py:41, in TextLoader.load(self)
     40     with open(self.file_path, encoding=self.encoding) as f:
---> 41         text = f.read()
     42 except UnicodeDecodeError as e:

File ~\anaconda3\envs\langchain-test\lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final)
     22 def decode(self, input, final=False):
---> 23     return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1897: character maps to <undefined>

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
Cell In[1], line 8
      4 from langchain.document_loaders import TextLoader
      7 loader = TextLoader("elon_musk.txt")
----> 8 documents = loader.load()
      9 text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
     10 docs = text_splitter.split_documents(documents)

File ~\anaconda3\envs\langchain-test\lib\site-packages\langchain\document_loaders\text.py:54, in TextLoader.load(self)
     52                 continue
     53     else:
---> 54         raise RuntimeError(f"Error loading {self.file_path}") from e
     55 except Exception as e:
     56     raise RuntimeError(f"Error loading {self.file_path}") from e

RuntimeError: Error loading elon_musk.txt

Appreciate any suggestion that could help me unblock.

Rito
  • 3,092
  • 2
  • 27
  • 40

4 Answers4

1

It does not look like a LangChain issue but just an encoding non-conformance with Unicode in your input file.

Following separation of concerns, I would therefore re-encode the file as compliant unicode first and then pass it to LangChain:

# Read the file using the correct encoding
with open("elon_musk.txt", "r", encoding="utf-8") as f:
    text = f.read()

# Write the text back to a new file, ensuring it's in UTF-8 encoding
with open("elon_musk_utf8.txt", "w", encoding="utf-8") as f:
    f.write(text) 

loader = TextLoader("elon_musk_utf8.txt")
documents = loader.load()

[Optional] In case the first read method, with UTF-8 encoding, fails (because of some unexpected exotic character encoding in the input file), I would let Python automatically find out what the actual encoding of your file is and pass it to the open method. To detect the actual encoding, I would use the chardet library this way:

import chardet

def detect_encoding(file_path):
    with open(file_path, 'rb') as f:
        result = chardet.detect(f.read())
    return result['encoding']

encoding = detect_encoding("elon_musk.txt")

with open("elon_musk.txt", 'r', encoding=encoding) as f:
    text = f.read()

with open("elon_musk_utf8.txt", 'w', encoding='utf-8') as f:
    f.write(text)

loader = TextLoader("elon_musk_utf8.txt")
documents = loader.load()
Marc
  • 2,183
  • 2
  • 11
  • 16
0

Try the DirectoryLoader, that worked.

0

You can load and split the document using this code:

with open('test.txt', 'w') as f:
   f.write(doc.decode('utf-8'))

with open('test.txt', 'r') as f:
   text = f.read()

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

def count_tokens(text: str) -> int:
    return len(tokenizer.encode(text))

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 64 ,
    chunk_overlap  = 24,
    length_function = count_tokens,
    )

chunks = text_splitter.create_documents([text])
Hamza Dh
  • 1
  • 1
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jul 20 '23 at 10:07
0

I just had this same problem. Code worked fine in Colab (Unix), but not in VS code. Tried Marc's suggestions to no avail. Checked that VSCode preference was UTF-8 for encoding. Verified that the files were exactly the same on both machines. Even ensured they had the same python version!

Here is what worked for me. When using TextLoader, do it like this:

loader = TextLoader("elon_musk.txt", encoding = 'UTF-8')

When using DirectoryLoader, instead of this:

loader = DirectoryLoader("./new_articles/", glob="./*.txt",   loader_cls=TextLoader)

Do This:

text_loader_kwargs={'autodetect_encoding': True}
loader = DirectoryLoader("./new_articles/", glob="./*.txt", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)