Questions tagged [pdf-parsing]

Deals with extracting useful information from PDF content (for example, text or images)

PDF (Portable Document Format) is a binary format for digital documents. This tag is concerned with parsing these documents, that is to say, extract text, images or other data from them, or convert them to simpler formats (such as plain-text).

Because of the complexity of the PDF format (cf. the specification ISO 32000-1), its parsers are often incomplete (can't extract all information from all documents), and subject to security risks.

For example, pdf-parser is a command-line program that parses and analyses PDF documents. It provides features to extract raw data from PDF documents, like compressed images.

Python Related Options:

You may extract the table directly using camelot PDF Table Extraction for Humans
You may treat the pdf directly using tabula
You may convert the pdf to text using pdftotext, then parse text with python
You may use external tool, to convert your pdf file to excel or csv, then use required python module to open the excel/csv file.
You may also convert pdf to an image file, then use any recent OCR software (which reconstruct table automatically from the picture) to get data
pdf2image with pytesseract and an example.

Related Questions:

177 questions

votes

6 answers

struct.error: unpack requires a string argument of length 16

While processing a PDF file (2.pdf) with pdfminer (pdf2txt.py) I received the following error: pdf2txt.py 2.pdf Traceback (most recent call last): File "/usr/local/bin/pdf2txt.py", line 115, in if __name__ == '__main__':…

asked Oct 20 '16 at 15:28

Danil

4,781
1
35
50

votes

1 answer

Difference between iTextSharp 4.1.6 and 5.x versions

We are developing a Pdf parser to be used along with our system. The requirement is such that, we store all the information on any pdf documents and should be able to reproduce the document as such (with minimal changes from original document). We…

pdf licensing itext pdf-parsing

asked Jun 20 '14 at 11:59

Shankar

votes

0 answers

Regarding No Unicode mapping error while parsing pdf

I have bunch of pdf files (from different sources) and I'd like to extract text from them (unfortunately can't attach the files). Current parsing outcome: Tika silently returns text, which is missing a lot of needed data. Using PDFBox directly…

parsing unicode pdfbox apache-tika pdf-parsing

asked Aug 06 '20 at 04:17

exenza

votes

2 answers

PDFminer empty output

While processing a file with pdfminer (pdf2txt.py) I received empty output: dan@work:~/project$ pdf2txt.py docs/homericaeast.pdf dan@work:~/project$ Can anybody say what wrong with this file and what I can do to get data from it? Here's…

python pdf pdfminer pdf-parsing

asked May 07 '17 at 14:10

Danil

4,781
1
35
50

votes

2 answers

Apache PDFBox Remove Spaces between characters

We are using PDFBox to extract text from PDF's. Some PDF's text can't be extract correctly. The following image shows a part from the PDF as image: After text extraction we get the following text: 3, 8 5 EU R 1 Netto 38,50 EUR…

pdfbox text-extraction pdf-parsing

asked Apr 10 '15 at 06:01

TobiasH

votes

1 answer

get text paragraph from pdf using itextsharp

is there any logic to get paragraph text from pdf file using itextsharp?i know pdf only supports run of texts and its hard to determine which runs of texts are related to which paragraph and also i know that there isn't any

tags or other tags to…

c# asp.net itext pdf-parsing

asked Jun 14 '13 at 05:39

Bibek Gautam

votes

1 answer

haskell - parsing/reading content of .pdf-files

is there any possibility in haskell to just decrypt a .pdf file, read in the content and return a String? And, if there is one, could you give me a little example like e.g.: ... import necessaryPackage ... pdfParsing = ... ... Thanks in…

parsing pdf haskell ghc pdf-parsing

asked Mar 03 '13 at 14:32

jimmyt

votes

2 answers

Parsing PDF files in Hadoop Map Reduce

I have to parse PDF files , that are in HDFS in a Map Reduce Program in Hadoop. So i get the PDF file from HDFS as Input splits and it has to be parsed and sent to the Mapper Class. For implementing this InputFormat I had gone through this link .…

pdf hadoop mapreduce pdf-parsing

asked Feb 24 '12 at 08:41

WR10

votes

1 answer

Identify and extract specific sections of a PDF document

I have several exams in PDF format. I want to programatically extract each question as a separate image/document. OCR is not ideal because it does not maintain code/equation formatting well. The end goal is to make flash cards with each card…

python pdf ocr image-recognition pdf-parsing

asked Nov 07 '17 at 01:54

aki

votes

2 answers

PDF Cross Reference Streams

I'm developing a PDF parser/writer, but I'm stuck at generating cross reference streams. My program reads this file and then removes its linearization, and decompresses all objects in object streams. Finally it builds the PDF file and saves it. This…

pdf pdf-generation pdf-parsing

asked Dec 29 '10 at 17:30

Van Coding

24,244
24
88
132

votes

2 answers

Extracting tables from a pdf

I'm trying to get the data from the tables in this PDF. I've tried pdfminer and pypdf with a little luck but I can't really get the data from the tables. This is what one of the tables looks like: As you can see, some columns are marked with an…

python python-2.7 ocr pdfminer pdf-parsing

asked Jan 13 '15 at 17:22

user

votes

1 answer

Python PDFMiner error: "No /Root object! - Is this really a PDF?"

I am getting this error "No /Root object! - Is this really a PDF?" using my MAC computer with Python 2.7 and PDFMiner version 20110515. The pdf files are not damaged because the same program with the same files works on my PC computer! Also I have…

python macos pdf document-root pdf-parsing

asked Jun 26 '13 at 22:42

Mahshid Zeinaly

3,590
6
25
32

votes

2 answers

Extract all text with string positions from a PDF

This may seem an old question, but I didn't find an exhaustive answer after spending half an hour searching all over SO. I am using PDFBox and I would like to extract all of the text from a PDF file along with the coordinates of each string. I am…

java pdfbox pdf-parsing

asked Apr 02 '12 at 10:49

Andrea Sprega

2,221
2
29
35

votes

3 answers

AttributeError: 'bytes' object has no attribute 'close' when Tika parser is run

Im trying to run a simple parse line of code using Tika to parse text from a PDF (named outputFileName in this example). This used to run without errors. I recently had my laptop sent in to our work IT for software updates and had to resintall…

python parsing apache-tika pdf-parsing tika-server

asked Nov 11 '19 at 14:46

dweir247

votes

1 answer

node.js How to use a url as pdf-path to work with pdf2json

I'm using node.js and pdf2json parser to parse a pdf file. Currently it is working with a local pdf file. But I'm trying to get a pdf-file through the URL/HTTP Module of node.js and I want to open this file to parse it. Is there any possibility to…

javascript node.js parsing pdf pdf-parsing

asked Jul 12 '17 at 10:27

Daniel Wahl

Prev 1

…

11 12 Next