Questions tagged [pdf-parsing]

Deals with extracting useful information from PDF content (for example, text or images)

PDF (Portable Document Format) is a binary format for digital documents. This tag is concerned with parsing these documents, that is to say, extract text, images or other data from them, or convert them to simpler formats (such as plain-text).

Because of the complexity of the PDF format (cf. the specification ISO 32000-1), its parsers are often incomplete (can't extract all information from all documents), and subject to security risks.

For example, pdf-parser is a command-line program that parses and analyses PDF documents. It provides features to extract raw data from PDF documents, like compressed images.

Python Related Options:

  • You may extract the table directly using camelot PDF Table Extraction for Humans
  • You may treat the pdf directly using tabula
  • You may convert the pdf to text using pdftotext, then parse text with python
  • You may use external tool, to convert your pdf file to excel or csv, then use required python module to open the excel/csv file.
  • You may also convert pdf to an image file, then use any recent OCR software (which reconstruct table automatically from the picture) to get data
  • pdf2image with pytesseract and an example.

Related Questions:

177 questions
0
votes
1 answer

How to get image from local directory on a pdf created with ITextRenderer?

I'm parsing pdf from html with ITextRenderer as follows: private void createPdf(File file, String content) throws IOException, DocumentException { OutputStream os = new FileOutputStream(file); content = tidyUpHTML(content); …
Steve Waters
  • 3,348
  • 9
  • 54
  • 94
0
votes
0 answers

Extract a section from PDF-File

My goal is to extract the Abstract of a PDF-File. Is there a possibility to extract the text after a keyword (Abstract) and or search for a specific font style and extract a section of a Document? Currently, I'm using PDFBox to extract the text but…
Clemens
  • 99
  • 1
  • 10
0
votes
1 answer

Can we able to Split PDF files using Pig Udfs?

I have 100 pdf's but each pdf's have 40 pages, i.e.. it's not processed. Actually, we are trying to use pig Udf?? Can we able to Split PDF files using Pig Udf??
0
votes
0 answers

How to parse line of PDF file from PHP?

I want to parse PDF file from PHP. For this, I have build this code (I have used PDF Parser library). Code:
bircastri
  • 2,169
  • 13
  • 50
  • 119
0
votes
1 answer

PdfReaderContentParser.ProcessContent returns whitespace for clear text

I'd like to parse a pdf for texts containing both, binary and clear text data. When I try to do it with PdfReaderContentParser the GetResultantText method returns the right texts for the binary content but whitespaces for the clear text content.…
seeb
  • 67
  • 1
  • 10
0
votes
1 answer

Python: parse pdf with images

I want to parse some pdf files that contains text and may or may not contain images. I want to extract the text portion as string for further processing and save the image as jpeg/png or any other image format. what should be the best module to work…
Kamrul Khan
  • 3,260
  • 4
  • 32
  • 59
0
votes
0 answers

Itext - Retrieving the image width in inches incorrectly

I am using the below function public void renderImage(ImageRenderInfo renderInfo) { try { String filename; FileOutputStream os; PdfImageObject image = renderInfo.getImage(); PdfDictionary imageDict =…
Abhinav
  • 8,028
  • 12
  • 48
  • 89
0
votes
1 answer

How to convert the PDF content code to the type like "(<0034>) Tj"?

PDF content are saved as several ways, "(abc) Tj", "(<0035><0035>) Tj" or "\u065". I want to know if there is a way to convert the PDF code to one type, no matter direct text "(abc) Tj", or hexadecimal "(<0035><0035>) Tj", or Octal "\u065". I think…
SuperBerry
  • 1,193
  • 1
  • 12
  • 28
0
votes
1 answer

Error while retrieving images from pdf using Itext

I have an existing PDF from which I want to retrieve images NOTE: In the Documentation, this is the RESULT variable public static final String RESULT = "results/part4/chapter15/Img%s.%s"; I am not getting why this image is needed?I just want to…
Abhinav
  • 8,028
  • 12
  • 48
  • 89
0
votes
1 answer

How to NSLog a bytebuffer ( NSData / const char* ) that includes zeros in the buffer stream?

I want to NSLog the content of a PDF that has compressed stream objects which include zeros ('0') in the middle of the stream. Unfortunately the first occurrence of '0' in the first stream object terminates the output on the console... Couldn't find…
mramosch
  • 458
  • 4
  • 14
0
votes
2 answers

How open and read pdf (originally .html) file using Python3

I need to open this file in python3: http://www.arch.gob.ec/index.php/descargas/doc_download/478-historial-de-produccion-nacional-de-crudo-2011.html Here will I have to read it, and extract the data tables. I have searched for several hours but…
0
votes
0 answers

How to get text from pdf preserving original formatting (with CTX_DOC)?

I use this code to filter text from pdf file: create or replace directory pdf_dir as '&1'; create or replace directory l_curr_dir as '&3'; declare ll_clob CLOB; l_bfile BFILE; l_filename VARCHAR2(200) := '&2'; begin begin …
pradeep
  • 7
  • 6
0
votes
1 answer

Pdf processing and manipulation online

I'd like to show a pdf file online and provide a translations when words are clicked in pdf. Pdf is coming from user and doesn't have any markup from me. If a translated pdf is available I'd like to show fragments of the translation pdf when…
jeff
  • 1,169
  • 1
  • 19
  • 44
0
votes
1 answer

How to check if a checkbox is checked or not on a non-form PDF using C#?

Using c#, I want to see if a specific check box is checkd on a PDF page. The PDF file is not a form one. PDF could be something like: Sample file is here: MDS30ResidentP2.pdf (in this sample file, I want to somehow figure it out that check-box "E"…
Tohid
  • 6,175
  • 7
  • 51
  • 80
0
votes
1 answer

Parsing a PDF file using IText to add hyper link in existing texts

I know that PDFs are not for editing,but I have a requirement where I need to parse a PDF and modify it to convert all text elements to a hyper link. Is there a way to achieve this? Many Thanks,
Mukesh Kumar
  • 783
  • 1
  • 9
  • 24
1 2 3
11
12