Langchain pyPDFLoader

Question

I am currently trying to get started working with Langchain. I am working in Anaconda/Spyder IDE:

# Imports
import os 
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
from langchain.indexes import VectorstoreIndexCreator
import streamlit as st
from streamlit_chat import message



# Set API keys and the models to use
API_KEY = "MY API KEY HERE"
model_id = "gpt-3.5-turbo"

os.environ["OPENAI_API_KEY"] = API_KEY

pdf_path = '.\Paris.pdf'
loaders = PyPDFLoader(".\Paris.pdf")

I then run it with:

streamlit run c:\users\myname\.spyder-py3\untitled0.py [ARGUMENTS]

I get:

The streamlit module does run and opens in the browser but I get an error.

ValueError: File path .\Paris.pdf is not a valid file or url

I have checked carefully and the PDF is in fact located in the correct directory (i.e. the directory where the python script is located).

As a test I also tried:

# Imports
from PyPDF2 import PdfReader

pdf_path = './Paris.pdf'

with open(pdf_path, 'rb') as file:
    pdf = PdfReader(file)
    num_pages = len(pdf.pages)

    for page_number in range(num_pages):
        page = pdf.pages[page_number]
        page_text = page.extract_text()
        print(f"Page {page_number + 1}:\n{page_text}")

This worked perfectly. Note that I used the same path as with the langchain/streamlit version. I have installed langchain (multiple times), pyPDF and streamlit.

I then tried:

import os

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader(".\Paris.pdf")
pages = loader.load_and_split()
print(pages)

That works. What is wrong in the first code snippet that causes the file path to throw an exception.

I investigated further and it turns out that the addition of the streamlit components of the code cause the file path issue to occur.

what error do you get when you just put like this `PyPDFLoader("Paris.pdf")` ? — simpleApp, Jul 11 '23 at 19:47
Here's something that might assist you: consider exploring this implementation using LangChain - you can find it at [PrivateDocBot](https://github.com/Abhi5h3k/PrivateDocBot) — Abhi, Aug 27 '23 at 15:16

score 1 · Answer 1 · answered Aug 02 '23 at 05:58

Since it is an error of streamlit components, I would suggest you to use streamlit's file_uploader method as follows:

import streamlit as st

uploaded_file = st.file_uploader("Upload your PDF")

But in this case, you will have to read the pdf file in another approach which is by using PyPDF2.PdfReader as follows:

import streamlit as st
from PyPDF2 import PdfReader

uploaded_file = st.file_uploader("Upload your PDF")
if uploaded_file is not None:
   reader = PdfReader(uploaded_file)

If you need the uploaded pdf to be in the format of Document (which is when the file is uploaded through langchain.document_loaders.PyPDFLoader) then you can do the following:

import streamlit as st
from PyPDF2 import PdfReader
from langchain.docstore.document import Document

uploaded_file = st.file_uploader("Upload your PDF")
if uploaded_file is not None:
    docs = []
    reader = PdfReader(uploaded_file)
    i = 1
    for page in reader.pages:
        docs.append(Document(page_content=page.extract_text(), metadata={'page':i}))
        i += 1

Langchain pyPDFLoader

1 Answers1