I currently trying to implement langchain functionality to talk with pdf documents. I have a bunch of pdf files stored in Azure Blob Storage. I am trying to use langchain PyPDFLoader to load the pdf files to the Azure ML notebook. However, I am not being able to get it done. If I have the pdf stored locally, it is no problem, but to scale up I have to connect to the blob store. I have not really found any documents on langchain website or azure website. Wondering, if any of you is having similar problem.
Thank you
Below is an example of code i am trying:
from azureml.fsspec import AzureMachineLearningFileSystem
fs = AzureMachineLearningFileSystem("<path to datastore>")
from langchain.document_loaders import PyPDFLoader
with fs.open('*/.../file.pdf', 'rb') as fd:
loader = PyPDFLoader(document)
data = loader.load()
Error: TypeError: expected str, bytes or os.PathLike object, not StreamInfoFileObject
Another example tried:
from langchain.document_loaders import UnstructuredFileLoader
with fs.open('*/.../file.pdf', 'rb') as fd:
loader = UnstructuredFileLoader(fd)
documents = loader.load()
Error: TypeError: expected str, bytes or os.PathLike object, not StreamInfoFileObject