from langchain.document_loaders import UnstructuredPDFLoader
files = os.listdir(pdf_folder_path)
loaders = [UnstructuredPDFLoader(os.path.join(pdf_folder_path, fn)) for fn in files]
docs = loader.load()
The Docs here miss hyperlinks
, I can get links which are explicitly mentioned, I looked at couple of others (https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf)
PyMuPDFLoader,PDFPlumberLoader etc
from langchain.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader("example_data/layout-parser-paper.pdf")
data = loader.load()
when I go ahead and create
index = VectorstoreIndexCreator(
embedding=HuggingFaceEmbeddings(),
text_splitter=CharacterTextSplitter(chunk_size=1000, chunk_overlap=10)).from_loaders(loaders)
chain = RetrievalQA.from_chain_type(llm=foundation_model,
chain_type="stuff",
retriever=index.vectorstore.as_retriever(),
input_key="question")
The answers I retrieve will not have hyperlinks something like this look at FAQ(original hyper links come as text) reason being loaders in Langchain are missing it I see this issue getting closed https://github.com/langchain-ai/langchain/issues/8157