Questions tagged [pdf-extraction]

Extracting text and other data from a PDF document, regardless of the libraries used to achieve this.

148 questions
1
vote
1 answer

How to extract anchor text/ words from every hyperlinks from pdf using python?

I am trying to extract hyperlink present in each page with their anchor text from pdf using PymuPdf library. I am able to extract hyperlinks with their page numbers but couldn't able to extract anchor text/words for every hyperlinks. Can anyone help…
1
vote
0 answers

Extract specific pages from a single pdf file and save as separate individual files

I'm very new to Python. I just started a week ago and am trying to learn some cool stuff around PDF, but really don't know how to go about this. I have the attached pdf file that I would like to extract all the pages between the keywords "PAGE…
1
vote
1 answer

Is there a PDF parsing library that can extract text from given coordinates?

Good morning, fellas. I have been assigned a task wherein I am supposed to extract text from a PDF file (a bank invoice), as per the given specification of fields and sections. This specification is given in a YAML file. The fields are expressed as…
Jim
  • 19
  • 2
1
vote
2 answers

Get all PDF files name under same folder and save in excel according to PDF file name

I have PDF files in same folder. How to get all PDF file names and save as excel file according to PDF file name. This is what I have tried def get_files(pdf_path): import os os.chdir(pdf_path) files = os.listdir() files = [x for x…
kkk
  • 95
  • 1
  • 2
  • 11
1
vote
1 answer

Document Understanding is extracting data from all the pages of pdf in UiPath

I am using Document Understanding in UiPath to extract data from multiple pdf's. Each pdf file contains multiple copies of the same page which I cannot remove. Trouble is: 1.) The Regex Extractor is extracting data from all the pages of the pdf…
spectre
  • 717
  • 7
  • 21
1
vote
0 answers

How to extract tables from PDFs while pulling in non-table text section identifiers

I'm working through extracting tables using pdfplumber in Python from a PDF that has mostly-consistent structure between pages. My goal is to extract each of the 2 tables under each section header (white font highlighted blue) on each page. See…
1
vote
2 answers

How to resolve the Java Exception In Initialize error?

I am creating an app where I can extract the text from the pdf. For this I am using PdfBox library. But when I import pdf from the file manager, app stops and it gives exception in Initialize error at line where PDFTextStripper is initialized. How…
1
vote
1 answer

How to extract any image with python PDF extraction?

I created a PDF extract program using TKinter, PYPDF2, and PIL by following a tutorial. This is the image extraction code def extract_images(page): images = [] if '/XObject' in page['/Resources']: xObject =…
UIB
  • 11
  • 1
1
vote
1 answer

How to convert DeviceRGB to System.Drawing.Color?

I am trying to get fill color of paths using itext7 using fillclr= pathrenderinfo.getfillcolor.getcolorvalue() but it gives the value in format of deviceRGB and I need to implement it in System.Drawing.Color. Is there any way to convert DeviceRGB…
1
vote
1 answer

How to find table grid lines in PDF files?

To more accurately extract table-like data embedded within table cells, I would like to be able to identify table cell boundaries in PDFs like this: I have tried extracting such tables using Camelot, pdfplumber, and PyMuPDF, with varying degrees of…
1
vote
1 answer

How can I improve text recognition accuracy with jTessBoxEditor?

I have been trying to extract data from scanned pdf documents. I have converted the pdf file into jpeg file (I have attached the image link below), cropped the words and numbers with different fonts, merged into a tiff file and trained the fonts…
Jeff
  • 11
  • 1
1
vote
0 answers

Overwriting the ToUnicode Map stream in a PDF

In this question, mkl provides a fantastic answer to pnj's predicament. We are unfortunately facing a very similar issue (with a different font called Lohit - Devanagari, but still a Devanagari font) The second comment outlines the non-OCR solution…
wireman
  • 144
  • 4
1
vote
1 answer

Extract data from pdf files with R

I am trying to extract data (tables) from pdf files and store them as data frames. library(pdftools) library(tabulizerjars) library(tabulizer) library(tidyverse) f <- file.path("D:/Araratbank/Statement USD-pages-1.pdf") #using pdf tools…
Hayk
  • 27
  • 1
  • 7
1
vote
2 answers

Pdf data extraction from scanned pdf using python

I was extracting data from scanned pdf by tesseract ocr and I am able to extract data but the accuracy is not good. At many places, its showing wrong data so can I get data with 100% accuracy by python. first I convert pdf to jpg format then I…
1
vote
3 answers

How do I extract tables from a historical PDF?

I need to extract data from similarly formatted tables from this file. There are some OCR errors but I have an automated method to correct them. I have tried: ABBYY Finereader table detection. Tabula table extraction Camelot table…
FBB
  • 326
  • 1
  • 12