3

I am currently trying to get started working with Langchain. I am working in Anaconda/Spyder IDE:

# Imports
import os 
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
from langchain.indexes import VectorstoreIndexCreator
import streamlit as st
from streamlit_chat import message



# Set API keys and the models to use
API_KEY = "MY API KEY HERE"
model_id = "gpt-3.5-turbo"

os.environ["OPENAI_API_KEY"] = API_KEY

pdf_path = '.\Paris.pdf'
loaders = PyPDFLoader(".\Paris.pdf")

I then run it with:

streamlit run c:\users\myname\.spyder-py3\untitled0.py [ARGUMENTS]

I get:

The streamlit module does run and opens in the browser but I get an error.

ValueError: File path .\Paris.pdf is not a valid file or url


I have checked carefully and the PDF is in fact located in the correct directory (i.e. the directory where the python script is located).

As a test I also tried:

# Imports
from PyPDF2 import PdfReader

pdf_path = './Paris.pdf'

with open(pdf_path, 'rb') as file:
    pdf = PdfReader(file)
    num_pages = len(pdf.pages)

    for page_number in range(num_pages):
        page = pdf.pages[page_number]
        page_text = page.extract_text()
        print(f"Page {page_number + 1}:\n{page_text}")

This worked perfectly. Note that I used the same path as with the langchain/streamlit version. I have installed langchain (multiple times), pyPDF and streamlit.

I then tried:

import os

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader(".\Paris.pdf")
pages = loader.load_and_split()
print(pages)

That works. What is wrong in the first code snippet that causes the file path to throw an exception.

I investigated further and it turns out that the addition of the streamlit components of the code cause the file path issue to occur.

NeilS
  • 65
  • 1
  • 7
  • what error do you get when you just put like this `PyPDFLoader("Paris.pdf")` ? – simpleApp Jul 11 '23 at 19:47
  • Here's something that might assist you: consider exploring this implementation using LangChain - you can find it at [PrivateDocBot](https://github.com/Abhi5h3k/PrivateDocBot) – Abhi Aug 27 '23 at 15:16

1 Answers1

1

Since it is an error of streamlit components, I would suggest you to use streamlit's file_uploader method as follows:

import streamlit as st

uploaded_file = st.file_uploader("Upload your PDF")

But in this case, you will have to read the pdf file in another approach which is by using PyPDF2.PdfReader as follows:

import streamlit as st
from PyPDF2 import PdfReader

uploaded_file = st.file_uploader("Upload your PDF")
if uploaded_file is not None:
   reader = PdfReader(uploaded_file)

If you need the uploaded pdf to be in the format of Document (which is when the file is uploaded through langchain.document_loaders.PyPDFLoader) then you can do the following:

import streamlit as st
from PyPDF2 import PdfReader
from langchain.docstore.document import Document

uploaded_file = st.file_uploader("Upload your PDF")
if uploaded_file is not None:
    docs = []
    reader = PdfReader(uploaded_file)
    i = 1
    for page in reader.pages:
        docs.append(Document(page_content=page.extract_text(), metadata={'page':i}))
        i += 1