Questions tagged [pdf-parsing]

Deals with extracting useful information from PDF content (for example, text or images)

PDF (Portable Document Format) is a binary format for digital documents. This tag is concerned with parsing these documents, that is to say, extract text, images or other data from them, or convert them to simpler formats (such as plain-text).

Because of the complexity of the PDF format (cf. the specification ISO 32000-1), its parsers are often incomplete (can't extract all information from all documents), and subject to security risks.

For example, pdf-parser is a command-line program that parses and analyses PDF documents. It provides features to extract raw data from PDF documents, like compressed images.

Python Related Options:

You may extract the table directly using camelot PDF Table Extraction for Humans
You may treat the pdf directly using tabula
You may convert the pdf to text using pdftotext, then parse text with python
You may use external tool, to convert your pdf file to excel or csv, then use required python module to open the excel/csv file.
You may also convert pdf to an image file, then use any recent OCR software (which reconstruct table automatically from the picture) to get data
pdf2image with pytesseract and an example.

Related Questions:

177 questions

votes

1 answer

pdf content stream parsing

i need help with parsing pdf the pdf builded in illustrator and it have 4 layer and each layer have one graphic path object what i wont to do is to get all the 4 graphic paths and draw them in another pdf file that have the same width and hight as…

c# pdf-generation pdfsharp pdf-parsing

asked Aug 05 '11 at 01:05

ygaradon

2,198
2
21
27

votes

1 answer

Ensure loop runs through every file even when errors are raised

I am iterating over a bunch of pdf in a folder, parse their content and append it to a list. It works on a subset of pdf-files. I dont want to manually remove some of the pdf, run the code and then add a few to run it again until i found the…

python exception error-handling pdfminer pdf-parsing

asked Sep 02 '21 at 15:38

id345678

votes

1 answer

Python - Google Cloud Document AI API- Not reading the whole .pdf file

I am trying to read a pdf stored in gcs i Python using Google Document AI API and return the text from the pdf as a string.I do not want the parser to read tables and images as iam only interested in text. Below is the code i am using to parse the…

google-api-python-client pdf-parsing cloud-document-ai

asked May 10 '21 at 14:58

Jayashree Sridhar

votes

1 answer

Do PDF name objects require capitalization?

Page 17 of the PDF 1.7 spec indicates that /lime#20Green should produce Lime Green. Is this an errata? I see nothing in the spec about capitalizing the first character of a name, and the example just below (paired#28#29parenthesis) does not correct…

pdf pdf-parsing

asked May 09 '21 at 10:26

murty

votes

0 answers

python java Tika urllib.error.URLError:

<<>> I saw this thread Python Tika error: URLError:

python-3.x runtime-error urllib apache-tika pdf-parsing

asked Mar 13 '21 at 04:14

rickuls

votes

2 answers

'Nonetype object is not itreable' when trying to extract from PDF

I am trying to extract data from a PDF, but I keep getting a type error because my object is not iterable (on the statement for line in text: but I don't understand why 'text' has no value, just above that I create the text object using text =…

python nonetype pdf-parsing pdf-extraction

asked Jan 10 '21 at 01:27

Don Carroll

votes

0 answers

Java pdfbox read text from PDF in Hindi language (non Enghlish PDF)

I am using Java PDFBOX to read text from PDF.It is working fine for PDF in English. but I want to read data from PDF in language other than English. Language in PDF is 'Hindi' (from India). Data I get in this case is like encoded strings. How I can…

java pdfbox pdf-parsing

asked Nov 10 '20 at 16:42

hrishi

1,610
6
26
43

votes

0 answers

What is the best way to extract the body of an article with Python?

Summary I am building a text summarizer in Python. The kind of documents that I am mainly targeting are scholarly papers that are usually in pdf format. What I Want to Achieve I want to effectively extract the body of the paper (abstract to…

python nlp pdf-parsing pdfparser

asked Aug 17 '20 at 17:13

mdave1701

votes

1 answer

Is there a way to pass credentials programmatically for using Google documentAI without reading from a disk?

I am trying to run the demo code given in PDF parsing of GCP document AI. To run the code, exporting Google credentials as a command line works fine. The problem comes when the code needs to run in memory and hence no credential files are allowed to…

python google-cloud-platform google-cloud-functions pdf-parsing cloud-document-ai

asked Jul 02 '20 at 14:42

sentinel

votes

0 answers

How to retrieve particular table data in multiple tables from a PDF using python

I have 100 annual reports of different banks. All these annual reports are of same format.I want to extract profit&loss table and balance sheet table from all the 100 PDFs and store in an excel file. Is there any way to do that using python? Below…

python excel tabular tabula pdf-parsing

asked Dec 23 '19 at 05:04

bibbi

votes

2 answers

java.net.URL class throwing MalformedException because of unknown protocol: blob

I'm automating my test scenario for validation of a pdf document. This document opens in a new browser tab once clicked on the document link(anchor tag). I want to validate a few important contents in a document for which I'm using Apache PDFBox.…

java selenium url automation pdf-parsing

asked Dec 12 '19 at 23:00

Shantanu

votes

0 answers

Need help, to parse PDF file in a structured way using java

Unable to parse PDF document as (key,value) pair. Can anyone, please help to parse PDF file in a structured manner? I was able to extract text from PDF file using below JAVA code. org.apache.pdfbox.pdmodel.PDDocument doc=null; …

java pdfbox key-value pdf-parsing

asked Sep 01 '19 at 07:44

Avinash

votes

0 answers

How to populate values to input elements present in existing PDF using Ruby?

I have a PDF with various input elements, check boxes, radio buttons. How can I populate values to those elements using ruby?

ruby pdf ruby-on-rails-5 pdf-parsing

asked Jun 20 '19 at 05:30

manohar vadapalli

votes

0 answers

Output all hyperlinks in a pdf with C#

a good friend of mine is currently writing a book for his PhD, and asked me if I could help him automate the process of checking all his given sources (hyperlinks). I've searched all over the internet and could not find any helpful tip on how to…

c# pdf pdf-parsing

asked May 02 '19 at 14:02

beadrex

votes

1 answer

Unable To Convert PDF to Text format

I am getting this error while parsing the PDF file using pypdf2 i am attaching PDF along with the error. I have attached the PDF to be parsed please click to view Can anyone help? import PyPDF2 def convert(data): pdfName = data read_pdf…

python python-3.x python-2.7 pdf-parsing

asked Apr 13 '19 at 19:20

Chandan Chanda

Prev 1 2 3

…

11 12 Next