Questions tagged [pdf-parsing]

Deals with extracting useful information from PDF content (for example, text or images)

PDF (Portable Document Format) is a binary format for digital documents. This tag is concerned with parsing these documents, that is to say, extract text, images or other data from them, or convert them to simpler formats (such as plain-text).

Because of the complexity of the PDF format (cf. the specification ISO 32000-1), its parsers are often incomplete (can't extract all information from all documents), and subject to security risks.

For example, pdf-parser is a command-line program that parses and analyses PDF documents. It provides features to extract raw data from PDF documents, like compressed images.

Python Related Options:

You may extract the table directly using camelot PDF Table Extraction for Humans
You may treat the pdf directly using tabula
You may convert the pdf to text using pdftotext, then parse text with python
You may use external tool, to convert your pdf file to excel or csv, then use required python module to open the excel/csv file.
You may also convert pdf to an image file, then use any recent OCR software (which reconstruct table automatically from the picture) to get data
pdf2image with pytesseract and an example.

Related Questions:

177 questions

votes

1 answer

How to avoid duplication in Python PDF parsing code for mismatching table structures?

I have over 100 PDFs that are match reports from which I want to scrape data in order to store it in dataframes so I can work with it afterwards. Problem is: Those PDFs don't always have the same structure and the reading from pdfplumber gives me…

asked May 25 '23 at 21:01

Pablo Martín Calvo

votes

0 answers

Extracting data from pdf in specific format

I want to extract the data in form of a hash. The text is like this. Sometimes there is a single Arrest type and multiple charge and charge description and sometimes one Arrest type and one charge and charge description. Sometime multiple arrest…

ruby pdf-parsing

asked Mar 31 '23 at 09:38

snoozy

votes

0 answers

Clustering a set of letters into a table by position

I have a set of letters that are positioned on the plane (for each letter I know the coordinates of its corner points, and strings can be treated as parallelograms). I know that strings form a table, but I don't know neither how many rows or columns…

pdf cluster-analysis pdf-parsing

asked Mar 20 '23 at 21:15

JohnDiGriz

votes

0 answers

How is Word Able to detect PDF structure so well where others fail? Is there a Library that can achieve this?

I've been interested in Parsing PDFs for some time now with varying degrees of sucess. Often however with PDFs useful data is contained in the text i.e. outside Tables etc. If you are to get data out of the sentences however, it is vital that the…

pdf cpu-word text-parsing pdf-parsing

asked Mar 10 '23 at 09:10

Nick

votes

0 answers

Error from tabula-java: Error: Error: Header doesn't contain versioninfo

I have a script that parses pdf files. On my WSL it's perfectly working, but when i deploy it on Centos 7, I have this error. I'm using tabula-py python version: 3.6 java version: 11 When I try to search for the error, I found nothing. Can someone…

python java tabula pdf-parsing tabula-py

asked Mar 10 '23 at 04:45

mayk.dyasper

votes

1 answer

How to calculate coordinates of the PDF text (knowing only the list of operations)

I'm processing a PDF document in a program. The only part of the document I have access to is a list of PDF operations (with their arguments), and a list of horizontal displacements for the glyphs and fonts that appear in the document. Is it…

pdf pdf-parsing

asked Mar 10 '23 at 00:34

JohnDiGriz

votes

0 answers

How to identify whether the text is boxed in PDF using PDFBOX?

I am trying to check whether the text is BOXED using apache PDFBOX. for few PDF the below code wont work. public class PDFBoxReader extends PDFGraphicsStreamEngine { private static ArrayList recList = new ArrayList(); …

pdfbox pdf-parsing

asked Feb 09 '23 at 03:50

Arunkumar

votes

1 answer

How do I reference the PDF IFilter (dll) interface built into Windows to extract text and properties of a pdf document via Classic ASP

I want to extract text and properties (author, title, etc.) of PDF file. I need to extract and parse Text from a pdf file in a classic ASP environment. I read another post about using the PDF iFilter driver installed with Adobe Acrobat 9 which can…

asp-classic ifilter pdf-parsing

asked Apr 15 '09 at 17:39

Sanjeev

votes

1 answer

how to recognize a graph in pdf using python?

new to pdf parsing. I want to recognize a graph in a pdf file, so I could skip it and not extract this type of text. all I know about the pdf is that it is generated from word (not scanned). Input - pdf with a graph such as this one. output should…

pdf text-parsing pdf-parsing pdfplumber

asked Nov 17 '22 at 12:22

learningtocode

votes

1 answer

How to extract text based on parts from a PDF file in JSON format?

From this file https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/text-extraction/Dart.pdf I would like to get this kind of result: { "file": { "title": "Dart Programming Language Specification", "1 Scope": { …

python ocr tesseract pdfminer pdf-parsing

asked Oct 31 '22 at 22:38

Tim

votes

1 answer

Php Pdf Parser read content showing as a two lines. need to fix it

I used pdfparserto read PDF content. but one address line showing as a two line. in that time it is showing as a two new lines. i want to get that full address as a one line. pdf files are dynamic. according to the address length it is showing as a…

php pdf pdf-parsing pdfparser

asked Oct 27 '22 at 06:30

Chaminda Chanaka

votes

1 answer

how to upload local pdf files to google collab notebook?

I want to upload a local pdf into google collab and parse it with python. How can I load the file so I could use with open?

python file google-colaboratory pdf-parsing

asked Oct 26 '22 at 22:30

learningtocode

votes

0 answers

Extract geometric objects (lines, circles,...) from a pdf using PDFMM

I have a PDF containing several geometric objects (mostly lines) in different sizes and color. I want to extract them in the following form, e.g. for lines: (startx, starty) (endx, endy) width color Optinal a "z" Position determining which object…

c++ pdf-parsing podofo

asked Sep 16 '22 at 15:04

Peter

votes

1 answer

Apache PDFBox - vertical match between image and text position

I need help to achieve a mapping between text and image objects in a PDF document. As the first figure shows, my PDF documents have 3 images arranged randomly in the y-direction. To the left of them are texts. The texts extend along the height of…

java pdfbox pdf-parsing

asked May 17 '22 at 12:18

ralle

votes

1 answer

(while reading XRef): Error: Invalid XRef stream header?

hi i am trying to read pdf in node js . when i try to read this pdf. it start showing this error. (while reading XRef): Error: Invalid XRef stream header Error: Error: Invalid XRef stream header at error (eval at …

javascript node.js pdf-parsing

asked May 16 '22 at 09:28

satyaarth chhabra

Prev 1 2 3

…

11 12 Next