Extracting text and other data from a PDF document, regardless of the libraries used to achieve this.
Questions tagged [pdf-extraction]
148 questions
1
vote
1 answer
How to extract anchor text/ words from every hyperlinks from pdf using python?
I am trying to extract hyperlink present in each page with their anchor text from pdf using PymuPdf library. I am able to extract hyperlinks with their page numbers but couldn't able to extract anchor text/words for every hyperlinks.
Can anyone help…

gagan lohar
- 11
- 2
1
vote
0 answers
Extract specific pages from a single pdf file and save as separate individual files
I'm very new to Python. I just started a week ago and am trying to learn some cool stuff around PDF, but really don't know how to go about this.
I have the attached pdf file that I would like to extract all the pages between the keywords "PAGE…

Learner
- 11
- 2
1
vote
1 answer
Is there a PDF parsing library that can extract text from given coordinates?
Good morning, fellas. I have been assigned a task wherein I am supposed to extract text from a PDF file (a bank invoice), as per the given specification of fields and sections. This specification is given in a YAML file. The fields are expressed as…

Jim
- 19
- 2
1
vote
2 answers
Get all PDF files name under same folder and save in excel according to PDF file name
I have PDF files in same folder. How to get all PDF file names and save as excel file according to PDF file name.
This is what I have tried
def get_files(pdf_path):
import os
os.chdir(pdf_path)
files = os.listdir()
files = [x for x…

kkk
- 95
- 1
- 2
- 11
1
vote
1 answer
Document Understanding is extracting data from all the pages of pdf in UiPath
I am using Document Understanding in UiPath to extract data from multiple pdf's. Each pdf file contains multiple copies of the same page which I cannot remove. Trouble is:
1.) The Regex Extractor is extracting data from all the pages of the pdf…

spectre
- 717
- 7
- 21
1
vote
0 answers
How to extract tables from PDFs while pulling in non-table text section identifiers
I'm working through extracting tables using pdfplumber in Python from a PDF that has mostly-consistent structure between pages.
My goal is to extract each of the 2 tables under each section header (white font highlighted blue) on each page. See…

WinstonDoodle
- 23
- 3
1
vote
2 answers
How to resolve the Java Exception In Initialize error?
I am creating an app where I can extract the text from the pdf. For this I am using PdfBox library. But when I import pdf from the file manager, app stops and it gives exception in Initialize error at line where PDFTextStripper is initialized. How…

Sarthak Kumar
- 81
- 8
1
vote
1 answer
How to extract any image with python PDF extraction?
I created a PDF extract program using TKinter, PYPDF2, and PIL by following a tutorial.
This is the image extraction code
def extract_images(page):
images = []
if '/XObject' in page['/Resources']:
xObject =…

UIB
- 11
- 1
1
vote
1 answer
How to convert DeviceRGB to System.Drawing.Color?
I am trying to get fill color of paths using itext7 using
fillclr= pathrenderinfo.getfillcolor.getcolorvalue()
but it gives the value in format of deviceRGB and I need to implement it in System.Drawing.Color. Is there any way to convert DeviceRGB…

Shivanand Pandey
- 27
- 4
1
vote
1 answer
How to find table grid lines in PDF files?
To more accurately extract table-like data embedded within table cells, I would like to be able to identify table cell boundaries in PDFs like this:
I have tried extracting such tables using Camelot, pdfplumber, and PyMuPDF, with varying degrees of…

Mark Turner
- 81
- 2
- 5
1
vote
1 answer
How can I improve text recognition accuracy with jTessBoxEditor?
I have been trying to extract data from scanned pdf documents.
I have converted the pdf file into jpeg file (I have attached the image link below), cropped the words and numbers with different fonts, merged into a tiff file and trained the fonts…

Jeff
- 11
- 1
1
vote
0 answers
Overwriting the ToUnicode Map stream in a PDF
In this question, mkl provides a fantastic answer to pnj's predicament. We are unfortunately facing a very similar issue (with a different font called Lohit - Devanagari, but still a Devanagari font) The second comment outlines the non-OCR solution…

wireman
- 144
- 4
1
vote
1 answer
Extract data from pdf files with R
I am trying to extract data (tables) from pdf files and store them as data frames.
library(pdftools)
library(tabulizerjars)
library(tabulizer)
library(tidyverse)
f <- file.path("D:/Araratbank/Statement USD-pages-1.pdf")
#using pdf tools…

Hayk
- 27
- 1
- 7
1
vote
2 answers
Pdf data extraction from scanned pdf using python
I was extracting data from scanned pdf by tesseract ocr and I am able to extract data but the accuracy is not good. At many places, its showing wrong data so can I get data with 100% accuracy by python.
first I convert pdf to jpg format then I…

Sumesh Kumar
- 11
- 1
- 2
1
vote
3 answers
How do I extract tables from a historical PDF?
I need to extract data from similarly formatted tables from this file. There are some OCR errors but I have an automated method to correct them.
I have tried:
ABBYY Finereader table detection.
Tabula table extraction
Camelot table…

FBB
- 326
- 1
- 12