Questions tagged [pdf-parsing]

Deals with extracting useful information from PDF content (for example, text or images)

PDF (Portable Document Format) is a binary format for digital documents. This tag is concerned with parsing these documents, that is to say, extract text, images or other data from them, or convert them to simpler formats (such as plain-text).

Because of the complexity of the PDF format (cf. the specification ISO 32000-1), its parsers are often incomplete (can't extract all information from all documents), and subject to security risks.

For example, pdf-parser is a command-line program that parses and analyses PDF documents. It provides features to extract raw data from PDF documents, like compressed images.

Python Related Options:

You may extract the table directly using camelot PDF Table Extraction for Humans
You may treat the pdf directly using tabula
You may convert the pdf to text using pdftotext, then parse text with python
You may use external tool, to convert your pdf file to excel or csv, then use required python module to open the excel/csv file.
You may also convert pdf to an image file, then use any recent OCR software (which reconstruct table automatically from the picture) to get data
pdf2image with pytesseract and an example.

Related Questions:

177 questions

votes

1 answer

PDFplumber password and check_extractable

I am using pdfplumber library for parsing pdf. The way to access a pdf file is "pdfplumber.open(path)". Can someone please help me how to pass the password and the check_extractable parameters in this.

asked Feb 22 '19 at 10:45

Nikhil Bhawsinka

votes

1 answer

Ghostscript txtwrite bbox limits

When I use ghostscript with textwrite device, I'm getting an XML file that describes my pdf, i.e …

pdf ghostscript bounding-box pdf-parsing

asked Jan 23 '19 at 07:42

Mugen

8,301
10
62
140

votes

1 answer

No tables found and merged column text when extracting data from this PDF using Camelot

I get a UserWarning: No tables found on page-1 when I try to extract tables from the attached PDF . However, when I looked at the extracted data, some of the column text was merged into a single column.” I am using Camelot to parse these PDFs Steps…

python pdf-parsing python-camelot

asked Nov 09 '18 at 18:39

Arpit Solanki

9,567
3
41
57

votes

1 answer

How to use page.filter(test_function) in PDFPlumber library?

I am trying to delete tables inside the a pdf page and I'm trying to use page.filter() function for that, here I have table bbox coordinates and I am trying to compare if object coordinates are inside the table coordinates or not. But I was unable…

python pdf pdf-parsing pdf-extraction

asked Nov 03 '18 at 08:30

Satyaaditya

votes

0 answers

How to get font info of a selected text in pdf using pdfbox

I have the coordinates of the selected text in the pdf. And I am using PDFTextStripperByArea to add and extract the region to get the text info. But I want to get the font info of that selected text. When I use getResources() method of…

java pdf pdfbox text-extraction pdf-parsing

asked May 01 '18 at 04:54

swarupn

votes

1 answer

Python- PDFTables parsing ignoring spaces between columns

I am trying to parse pdf tables by using pdftables python library. But it is combining columns and ignoring spaces. Here is my code: pdf_page = get_pdf_page(fileobj, page) tables = page_to_tables(pdf_page) Structure of tables in pdf…

python parsing pdf pdf-parsing

asked Apr 03 '18 at 05:46

Khushhal

votes

0 answers

How to parse text extracted from PDF file with delimiter using Python?

I have tried PyPDF2 to extract and parse text from PDF using following code segment; import PyPDF2 import re pdfFileObj = open('test.pdf', 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj) rawText = pdfReader.getPage().extractText() extractedText…

python parsing pdf pdf-parsing pypdf

asked Sep 24 '17 at 10:51

Nawshad Rehan Rasha

votes

1 answer

Error in Text format while parsing PDF using Smalot PDF parser

I'm trying to parse a pdf using Smalot PDF Parser but the problem is that the text is not formatted well. It is showing spaces between letters of words. For example: The word "Letter" is written as "L e tt e r". How I can correct it? Moreover, the…

php pdf tcpdf pdf-parsing

asked Sep 20 '17 at 06:54

Saqib Javed

votes

1 answer

Can only use wrapper function a single time after definition then getting NameError

Background I'm using pdfquery to scrap data from pdfs. Like this one. This questions builds off my earlier question here. I have successfully been able to use custom wrapper functions that can take arguments as seen in this answer. Except for the…

python python-3.x jupyter-notebook wrapper pdf-parsing

asked Aug 29 '17 at 18:32

James Draper

5,110
5
40
59

votes

2 answers

Using functools.partial to make custom filters for pdfquery getting attribute error

Background I'm using pdfquery to parse multiple files like this one. Problem I'm trying to write a generalized filer function, building off of the custom selectors mentioned in pdfquery's docs, that can take a specific range as an argument. Because…

python python-3.x pdf functools pdf-parsing

asked Aug 24 '17 at 18:40

James Draper

5,110
5
40
59

votes

1 answer

Read data from a PDF document that does not have an XFA-form

I use iText to read a PDF document containing an XFA form. I convert it to XML, read data from the XML and insert it in a datatbase. But if I dont have an XFA form in the PDF then how I can efficiently read data from the PDF?

pdf itext pdf-parsing

asked Aug 09 '17 at 08:53

hrishi

1,610
6
26
43

votes

1 answer

How is it possible to extract bookmarks from a PDF File in PHP using Smalot/PDFParser?

Right now I'm working with PHP and Laravel. My objective is to extract the most information possible out of an uploaded PDF file (using a Form and POST method) such as metadata (author, title, etc.), first page (cover), content of each page and the…

php laravel parsing pdf pdf-parsing

asked Aug 04 '17 at 10:30

Henrique Ferreira

votes

0 answers

How to parse and regenerate PDF using php

I want to edit some part of the PDF and regenerates it in the same format after editing. I have tried pdftk but it won't allow to edit the readonly labels.i got success to parse the pdf using Smalot pdf parser but now i don't know how to again…

php pdftk pdf-parsing

asked Feb 13 '17 at 06:00

prakash tank

1,269
1
9
15

votes

0 answers

Coding a PDF Text Parser in swift

I'm currently developing a pdf text parser completely in swift. I was looking trough the PDFKittens code and found this in the stringwithpdfstring method (In SimpleFont.m) taking a CGPDFStringRef as parameter. const unsigned char *bytes =…

ios swift pdf pdf-parsing cgpdf

asked Jan 17 '17 at 15:34

Michael Schmid

votes

1 answer

'Smalot PDF Parser' result: text not on the same line

So I installed PDF Parser (http://www.pdfparser.org/). I checked their website and used the demo. This gave me the result I wanted. After hours of searching how to use a composer I finally managed to get it working. Now I’m stuck with the next…

php pdf pdf-parsing

asked Jan 09 '17 at 15:01

PHPeter

Prev 1 2 3

…

11 12 Next