Questions tagged [pdf-parsing]

Deals with extracting useful information from PDF content (for example, text or images)

PDF (Portable Document Format) is a binary format for digital documents. This tag is concerned with parsing these documents, that is to say, extract text, images or other data from them, or convert them to simpler formats (such as plain-text).

Because of the complexity of the PDF format (cf. the specification ISO 32000-1), its parsers are often incomplete (can't extract all information from all documents), and subject to security risks.

For example, pdf-parser is a command-line program that parses and analyses PDF documents. It provides features to extract raw data from PDF documents, like compressed images.

Python Related Options:

You may extract the table directly using camelot PDF Table Extraction for Humans
You may treat the pdf directly using tabula
You may convert the pdf to text using pdftotext, then parse text with python
You may use external tool, to convert your pdf file to excel or csv, then use required python module to open the excel/csv file.
You may also convert pdf to an image file, then use any recent OCR software (which reconstruct table automatically from the picture) to get data
pdf2image with pytesseract and an example.

Related Questions:

177 questions

votes

1 answer

How to get image from local directory on a pdf created with ITextRenderer?

I'm parsing pdf from html with ITextRenderer as follows: private void createPdf(File file, String content) throws IOException, DocumentException { OutputStream os = new FileOutputStream(file); content = tidyUpHTML(content); …

asked Dec 08 '16 at 14:03

Steve Waters

3,348
9
54
94

votes

0 answers

Extract a section from PDF-File

My goal is to extract the Abstract of a PDF-File. Is there a possibility to extract the text after a keyword (Abstract) and or search for a specific font style and extract a section of a Document? Currently, I'm using PDFBox to extract the text but…

java pdfbox pdf-parsing

asked Nov 19 '16 at 21:20

Clemens

votes

1 answer

Can we able to Split PDF files using Pig Udfs?

I have 100 pdf's but each pdf's have 40 pages, i.e.. it's not processed. Actually, we are trying to use pig Udf?? Can we able to Split PDF files using Pig Udf??

apache-pig pdf-parsing pig-udf

asked May 03 '16 at 07:45

Manohar Reddy

votes

0 answers

How to parse line of PDF file from PHP?

I want to parse PDF file from PHP. For this, I have build this code (I have used PDF Parser library). Code:

php pdf pdf-parsing

asked Feb 11 '16 at 05:15

bircastri

2,169
13
50
119

votes

1 answer

PdfReaderContentParser.ProcessContent returns whitespace for clear text

I'd like to parse a pdf for texts containing both, binary and clear text data. When I try to do it with PdfReaderContentParser the GetResultantText method returns the right texts for the binary content but whitespaces for the clear text content.…

itext pdf-parsing

asked Nov 30 '15 at 08:58

seeb

votes

1 answer

Python: parse pdf with images

I want to parse some pdf files that contains text and may or may not contain images. I want to extract the text portion as string for further processing and save the image as jpeg/png or any other image format. what should be the best module to work…

python pdf-parsing

asked Sep 20 '15 at 20:32

Kamrul Khan

3,260
4
32
59

votes

0 answers

Itext - Retrieving the image width in inches incorrectly

I am using the below function public void renderImage(ImageRenderInfo renderInfo) { try { String filename; FileOutputStream os; PdfImageObject image = renderInfo.getImage(); PdfDictionary imageDict =…

java pdf itext pdf-parsing

asked Aug 24 '15 at 12:01

Abhinav

8,028
12
48
89

votes

1 answer

How to convert the PDF content code to the type like "(<0034>) Tj"?

PDF content are saved as several ways, "(abc) Tj", "(<0035><0035>) Tj" or "\u065". I want to know if there is a way to convert the PDF code to one type, no matter direct text "(abc) Tj", or hexadecimal "(<0035><0035>) Tj", or Octal "\u065". I think…

pdf pdf-generation ghostscript pdf-conversion pdf-parsing

asked Aug 22 '15 at 00:45

SuperBerry

1,193
1
12
28

votes

1 answer

Error while retrieving images from pdf using Itext

I have an existing PDF from which I want to retrieve images NOTE: In the Documentation, this is the RESULT variable public static final String RESULT = "results/part4/chapter15/Img%s.%s"; I am not getting why this image is needed?I just want to…

java pdf itext pdf-parsing

asked Aug 12 '15 at 10:22

Abhinav

8,028
12
48
89

votes

1 answer

How to NSLog a bytebuffer ( NSData / const char* ) that includes zeros in the buffer stream?

I want to NSLog the content of a PDF that has compressed stream objects which include zeros ('0') in the middle of the stream. Unfortunately the first occurrence of '0' in the first stream object terminates the output on the console... Couldn't find…

objective-c pdf nslog bytebuffer pdf-parsing

asked Jul 21 '15 at 17:28

mramosch

votes

2 answers

How open and read pdf (originally .html) file using Python3

I need to open this file in python3: http://www.arch.gob.ec/index.php/descargas/doc_download/478-historial-de-produccion-nacional-de-crudo-2011.html Here will I have to read it, and extract the data tables. I have searched for several hours but…

python pdf python-3.x web-scraping pdf-parsing

asked Jul 08 '15 at 12:46

Mathias Lia Carlsen

votes

0 answers

How to get text from pdf preserving original formatting (with CTX_DOC)?

I use this code to filter text from pdf file: create or replace directory pdf_dir as '&1'; create or replace directory l_curr_dir as '&3'; declare ll_clob CLOB; l_bfile BFILE; l_filename VARCHAR2(200) := '&2'; begin begin …

oracle plsql pdf-parsing bfile

asked Jun 21 '15 at 08:31

pradeep

votes

1 answer

Pdf processing and manipulation online

I'd like to show a pdf file online and provide a translations when words are clicked in pdf. Pdf is coming from user and doesn't have any markup from me. If a translated pdf is available I'd like to show fragments of the translation pdf when…

php pdf pdf-generation pdfbox pdf-parsing

asked Apr 20 '15 at 10:11

jeff

1,169
1
19
44

votes

1 answer

How to check if a checkbox is checked or not on a non-form PDF using C#?

Using c#, I want to see if a specific check box is checkd on a PDF page. The PDF file is not a form one. PDF could be something like: Sample file is here: MDS30ResidentP2.pdf (in this sample file, I want to somehow figure it out that check-box "E"…

c# pdf itext pdf-parsing

asked Aug 08 '14 at 19:11

Tohid

6,175
7
51
80

votes

1 answer

Parsing a PDF file using IText to add hyper link in existing texts

I know that PDFs are not for editing,but I have a requirement where I need to parse a PDF and modify it to convert all text elements to a hyper link. Is there a way to achieve this? Many Thanks,

java itext pdfbox pdf-parsing

asked Jul 21 '14 at 06:50

Mukesh Kumar

Prev 1 2 3

…

12 Next