How can I use `langchain.document_loaders.PyPDFLoader` for pdf documents uploaded on StreamLit?

Question

I am trying to build a webapp using StreamLit for reading documents (mainly pdf) and load the data using langchain.document_loaders.PyPDFLoader but I am ending up with an error as follows:

TypeError: stat: path should be string, bytes, os.PathLike or integer, not list

followed by :

File "/opt/homebrew/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
    exec(code, module.__dict__)
File "/Users/shuhulhandoo/MetaGeeks/PDF-URL_QA/app.py", line 133, in <module>
    main()
File "/Users/shuhulhandoo/MetaGeeks/PDF-URL_QA/app.py", line 75, in main
    loader = PyPDFLoader(pdf)
             ^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/langchain/document_loaders/pdf.py", line 92, in __init__
    super().__init__(file_path)
File "/opt/homebrew/lib/python3.11/site-packages/langchain/document_loaders/pdf.py", line 42, in __init__
    if not os.path.isfile(self.file_path) and self._is_valid_url(self.file_path):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen genericpath>", line 30, in isfile

In my code, I am actually uploading document (in streamlit) using:

import streamlit as st
from langchain.document_loaders import PyPDFLoader

uploaded_file = st.file_uploader("Upload PDF", type="pdf")
if uploader_file is not None:
    loader = PyPDFLoader(uploaded_file)

I am trying to use PyPDFLoader because I need the source of the documents such as page numbers to be saved up.

I tried adding the texts of each page in the pdf document page-wise as follows:

from PyPDF2 import PdfReader
import streamlit as st

uploaded_file = st.file_uploader("Upload PDF", type="pdf")

if uploaded_file is not None:
    texts = ""
    reader = PdfReader(uploaded_file)
    for page in reader.pages:
        texts += page.extract_text()

But in this case, I have lost the information of the page number which I need in my case.

score 1 · Answer 1 · answered Jul 27 '23 at 09:18

1

PyPdfLoader takes in file_path which is a string. That means you cannot directly pass the uploaded file.

What you can do is save the file to a temporary location and pass the file_path to pdf loader, then clean up afterwards.

# save the file temporarily
tmp_location = os.path.join('/tmp', file.filename)

loader = PyPDFLoader(tmp_location)
pages = loader.load_and_split()

# do whatever you need here

# clean up
if isinstance(file, Path):
   metadata.update({'file_name': file.name})

I hope this helps.

answered Jul 27 '23 at 09:18

Wai Lin Kyaw

11
1

Thanks for the reply! But the file that I uploaded is on `streamlit` using `streamlit.file_uploader`. There is not path implied to the the function `streamlit.file_uploader`. How can I use `PyMuPDFLoader` to this? – Shuhul Handoo Jul 27 '23 at 10:20
You can use the `tmp_location` file path to save the stream lit file. Just check the documentation how to save the file. For example: it will be something like `file.save(tmp_location)` – Wai Lin Kyaw Aug 01 '23 at 01:24

How can I use `langchain.document_loaders.PyPDFLoader` for pdf documents uploaded on StreamLit?

1 Answers1