Langchain Document loader is missing hyperlinks in the pdf file I have tried few loaders all have same problem

Question

from langchain.document_loaders import UnstructuredPDFLoader
files = os.listdir(pdf_folder_path)
loaders = [UnstructuredPDFLoader(os.path.join(pdf_folder_path, fn)) for fn in files]
docs = loader.load()

The Docs here miss hyperlinks, I can get links which are explicitly mentioned, I looked at couple of others (https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf) PyMuPDFLoader,PDFPlumberLoader etc

from langchain.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader("example_data/layout-parser-paper.pdf")
data = loader.load()

when I go ahead and create

index = VectorstoreIndexCreator(
    embedding=HuggingFaceEmbeddings(),
    text_splitter=CharacterTextSplitter(chunk_size=1000, chunk_overlap=10)).from_loaders(loaders)
chain = RetrievalQA.from_chain_type(llm=foundation_model,
                                    chain_type="stuff",
                                    retriever=index.vectorstore.as_retriever(),
                                    input_key="question")

The answers I retrieve will not have hyperlinks something like this look at FAQ(original hyper links come as text) reason being loaders in Langchain are missing it I see this issue getting closed https://github.com/langchain-ai/langchain/issues/8157

https://drive.google.com/file/d/1D8DtV3J_89LvgYOSs9LKRykgQKmqlD1K/view?usp=sharing please find sample here — nithin, Aug 29 '23 at 16:58
one way is to extract the links and create the loader document based on your use case. `pymupdf` provides a way to extract the links from the page and with some adjustment can get the linked text. sample code will look like `import fitz pdf_file = "data_source/test_url.pdf" doc =fitz.open(pdf_file) for page in doc: links=page.get_links() for e in links: txt=page.get_textbox(e['from'] + (+1, 0, -1, 0)) print(txt,e['uri'])` — simpleApp, Aug 30 '23 at 02:03
I am more or less doing on same lines but when my chain.run() executes I expect the model to output associated links for the question asked which doesn't work — nithin, Aug 30 '23 at 04:07

Langchain Document loader is missing hyperlinks in the pdf file I have tried few loaders all have same problem

0 Answers0