How to load pdf files from Azure Blob Storage with LangChain PyPDFLoader

Question

I currently trying to implement langchain functionality to talk with pdf documents. I have a bunch of pdf files stored in Azure Blob Storage. I am trying to use langchain PyPDFLoader to load the pdf files to the Azure ML notebook. However, I am not being able to get it done. If I have the pdf stored locally, it is no problem, but to scale up I have to connect to the blob store. I have not really found any documents on langchain website or azure website. Wondering, if any of you is having similar problem.

Thank you

Below is an example of code i am trying:

from azureml.fsspec import AzureMachineLearningFileSystem
fs = AzureMachineLearningFileSystem("<path to datastore>")

from langchain.document_loaders import PyPDFLoader
with fs.open('*/.../file.pdf', 'rb') as fd:
    loader = PyPDFLoader(document)
    data = loader.load()

Error: TypeError: expected str, bytes or os.PathLike object, not StreamInfoFileObject

Another example tried:

from langchain.document_loaders import UnstructuredFileLoader
with fs.open('*/.../file.pdf', 'rb') as fd:
    loader = UnstructuredFileLoader(fd)
documents = loader.load() 

Error: TypeError: expected str, bytes or os.PathLike object, not StreamInfoFileObject

you can download the blob with python code there are many examples online, subsequently you can build the logic for processing each file through your loader (UnstructuredFileLoader) — ZKS, Aug 31 '23 at 18:15

b1n3t · Answer 1 · 2023-08-31T18:52:59.710

Upon reading the error message you can see that it indicates langchain loaders await a file path or an object that represents it. but the AzureMachineLearningFileSystem.open() method returns a file alike object and therefore it is not directly compatible.

Maybe you should try to read the file with BytesIO and then use the stream to load into langchain.

import io
with fs.open('*/.../file.pdf', 'rb') as fd:
    content = io.BytesIO(fd.read())

loader = PyPDFLoader(content)
data = loader.load()

Please let me know if this solution works for you. If not, we can investigate the error further and try to solve it.

Edit: After reading the response by the question asker, I have came to realize that langchain loaders were expecting a path rather than an in-memory binary stream.

So this time I propose that we temporarily save the content from the Azure Blob to a local disk, then pass that path to the langchain loader. After processing, delete the temporary file.

import os
import tempfile
import io

with fs.open('*/.../file.pdf', 'rb') as fd:
    with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
        tmp.write(fd.read())
        tmp_path = tmp.name

try:
    loader = PyPDFLoader(tmp_path)
    data = loader.load()
finally:
    os.remove(tmp_path)

Thanks for the response. This also didn't work. This time, the error is: TypeError: stat: path should be string, bytes, os.PathLike or integer, not BytesIO — stackword_0, Aug 31 '23 at 18:11
@stackword_0 can you please read my edited answer and tell me if it works? Sorry, I am not able to set Azure Blob Storage right now therefore you must provide me with updates. — b1n3t, Aug 31 '23 at 18:52

How to load pdf files from Azure Blob Storage with LangChain PyPDFLoader

1 Answers1