Questions tagged [pdf-parsing]

Deals with extracting useful information from PDF content (for example, text or images)

PDF (Portable Document Format) is a binary format for digital documents. This tag is concerned with parsing these documents, that is to say, extract text, images or other data from them, or convert them to simpler formats (such as plain-text).

Because of the complexity of the PDF format (cf. the specification ISO 32000-1), its parsers are often incomplete (can't extract all information from all documents), and subject to security risks.

For example, pdf-parser is a command-line program that parses and analyses PDF documents. It provides features to extract raw data from PDF documents, like compressed images.

Python Related Options:

  • You may extract the table directly using camelot PDF Table Extraction for Humans
  • You may treat the pdf directly using tabula
  • You may convert the pdf to text using pdftotext, then parse text with python
  • You may use external tool, to convert your pdf file to excel or csv, then use required python module to open the excel/csv file.
  • You may also convert pdf to an image file, then use any recent OCR software (which reconstruct table automatically from the picture) to get data
  • pdf2image with pytesseract and an example.

Related Questions:

177 questions
0
votes
1 answer

pdf content stream parsing

i need help with parsing pdf the pdf builded in illustrator and it have 4 layer and each layer have one graphic path object what i wont to do is to get all the 4 graphic paths and draw them in another pdf file that have the same width and hight as…
ygaradon
  • 2,198
  • 2
  • 21
  • 27
0
votes
1 answer

Ensure loop runs through every file even when errors are raised

I am iterating over a bunch of pdf in a folder, parse their content and append it to a list. It works on a subset of pdf-files. I dont want to manually remove some of the pdf, run the code and then add a few to run it again until i found the…
id345678
  • 97
  • 1
  • 3
  • 21
0
votes
1 answer

Python - Google Cloud Document AI API- Not reading the whole .pdf file

I am trying to read a pdf stored in gcs i Python using Google Document AI API and return the text from the pdf as a string.I do not want the parser to read tables and images as iam only interested in text. Below is the code i am using to parse the…
0
votes
1 answer

Do PDF name objects require capitalization?

Page 17 of the PDF 1.7 spec indicates that /lime#20Green should produce Lime Green. Is this an errata? I see nothing in the spec about capitalizing the first character of a name, and the example just below (paired#28#29parenthesis) does not correct…
murty
  • 145
  • 1
  • 6
0
votes
0 answers

python java Tika urllib.error.URLError:

<<>> I saw this thread Python Tika error: URLError:
0
votes
2 answers

'Nonetype object is not itreable' when trying to extract from PDF

I am trying to extract data from a PDF, but I keep getting a type error because my object is not iterable (on the statement for line in text: but I don't understand why 'text' has no value, just above that I create the text object using text =…
0
votes
0 answers

Java pdfbox read text from PDF in Hindi language (non Enghlish PDF)

I am using Java PDFBOX to read text from PDF.It is working fine for PDF in English. but I want to read data from PDF in language other than English. Language in PDF is 'Hindi' (from India). Data I get in this case is like encoded strings. How I can…
hrishi
  • 1,610
  • 6
  • 26
  • 43
0
votes
0 answers

What is the best way to extract the body of an article with Python?

Summary I am building a text summarizer in Python. The kind of documents that I am mainly targeting are scholarly papers that are usually in pdf format. What I Want to Achieve I want to effectively extract the body of the paper (abstract to…
mdave1701
  • 37
  • 5
0
votes
1 answer

Is there a way to pass credentials programmatically for using Google documentAI without reading from a disk?

I am trying to run the demo code given in PDF parsing of GCP document AI. To run the code, exporting Google credentials as a command line works fine. The problem comes when the code needs to run in memory and hence no credential files are allowed to…
0
votes
0 answers

How to retrieve particular table data in multiple tables from a PDF using python

I have 100 annual reports of different banks. All these annual reports are of same format.I want to extract profit&loss table and balance sheet table from all the 100 PDFs and store in an excel file. Is there any way to do that using python? Below…
bibbi
  • 1
  • 2
0
votes
2 answers

java.net.URL class throwing MalformedException because of unknown protocol: blob

I'm automating my test scenario for validation of a pdf document. This document opens in a new browser tab once clicked on the document link(anchor tag). I want to validate a few important contents in a document for which I'm using Apache PDFBox.…
Shantanu
  • 1
  • 2
0
votes
0 answers

Need help, to parse PDF file in a structured way using java

Unable to parse PDF document as (key,value) pair. Can anyone, please help to parse PDF file in a structured manner? I was able to extract text from PDF file using below JAVA code. org.apache.pdfbox.pdmodel.PDDocument doc=null; …
Avinash
  • 113
  • 1
  • 1
  • 7
0
votes
0 answers

How to populate values to input elements present in existing PDF using Ruby?

I have a PDF with various input elements, check boxes, radio buttons. How can I populate values to those elements using ruby?
0
votes
0 answers

Output all hyperlinks in a pdf with C#

a good friend of mine is currently writing a book for his PhD, and asked me if I could help him automate the process of checking all his given sources (hyperlinks). I've searched all over the internet and could not find any helpful tip on how to…
beadrex
  • 13
  • 5
0
votes
1 answer

Unable To Convert PDF to Text format

I am getting this error while parsing the PDF file using pypdf2 i am attaching PDF along with the error. I have attached the PDF to be parsed please click to view Can anyone help? import PyPDF2 def convert(data): pdfName = data read_pdf…
Chandan Chanda
  • 143
  • 3
  • 12