Questions tagged [pdf-parsing]

Deals with extracting useful information from PDF content (for example, text or images)

PDF (Portable Document Format) is a binary format for digital documents. This tag is concerned with parsing these documents, that is to say, extract text, images or other data from them, or convert them to simpler formats (such as plain-text).

Because of the complexity of the PDF format (cf. the specification ISO 32000-1), its parsers are often incomplete (can't extract all information from all documents), and subject to security risks.

For example, pdf-parser is a command-line program that parses and analyses PDF documents. It provides features to extract raw data from PDF documents, like compressed images.

Python Related Options:

  • You may extract the table directly using camelot PDF Table Extraction for Humans
  • You may treat the pdf directly using tabula
  • You may convert the pdf to text using pdftotext, then parse text with python
  • You may use external tool, to convert your pdf file to excel or csv, then use required python module to open the excel/csv file.
  • You may also convert pdf to an image file, then use any recent OCR software (which reconstruct table automatically from the picture) to get data
  • pdf2image with pytesseract and an example.

Related Questions:

177 questions
0
votes
1 answer

PDFplumber password and check_extractable

I am using pdfplumber library for parsing pdf. The way to access a pdf file is "pdfplumber.open(path)". Can someone please help me how to pass the password and the check_extractable parameters in this.
0
votes
1 answer

Ghostscript txtwrite bbox limits

When I use ghostscript with textwrite device, I'm getting an XML file that describes my pdf, i.e
Mugen
  • 8,301
  • 10
  • 62
  • 140
0
votes
1 answer

No tables found and merged column text when extracting data from this PDF using Camelot

I get a UserWarning: No tables found on page-1 when I try to extract tables from the attached PDF . However, when I looked at the extracted data, some of the column text was merged into a single column.” I am using Camelot to parse these PDFs Steps…
Arpit Solanki
  • 9,567
  • 3
  • 41
  • 57
0
votes
1 answer

How to use page.filter(test_function) in PDFPlumber library?

I am trying to delete tables inside the a pdf page and I'm trying to use page.filter() function for that, here I have table bbox coordinates and I am trying to compare if object coordinates are inside the table coordinates or not. But I was unable…
Satyaaditya
  • 537
  • 8
  • 26
0
votes
0 answers

How to get font info of a selected text in pdf using pdfbox

I have the coordinates of the selected text in the pdf. And I am using PDFTextStripperByArea to add and extract the region to get the text info. But I want to get the font info of that selected text. When I use getResources() method of…
swarupn
  • 21
  • 1
0
votes
1 answer

Python- PDFTables parsing ignoring spaces between columns

I am trying to parse pdf tables by using pdftables python library. But it is combining columns and ignoring spaces. Here is my code: pdf_page = get_pdf_page(fileobj, page) tables = page_to_tables(pdf_page) Structure of tables in pdf…
Khushhal
  • 91
  • 1
  • 8
0
votes
0 answers

How to parse text extracted from PDF file with delimiter using Python?

I have tried PyPDF2 to extract and parse text from PDF using following code segment; import PyPDF2 import re pdfFileObj = open('test.pdf', 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj) rawText = pdfReader.getPage().extractText() extractedText…
0
votes
1 answer

Error in Text format while parsing PDF using Smalot PDF parser

I'm trying to parse a pdf using Smalot PDF Parser but the problem is that the text is not formatted well. It is showing spaces between letters of words. For example: The word "Letter" is written as "L e tt e r". How I can correct it? Moreover, the…
0
votes
1 answer

Can only use wrapper function a single time after definition then getting NameError

Background I'm using pdfquery to scrap data from pdfs. Like this one. This questions builds off my earlier question here. I have successfully been able to use custom wrapper functions that can take arguments as seen in this answer. Except for the…
James Draper
  • 5,110
  • 5
  • 40
  • 59
0
votes
2 answers

Using functools.partial to make custom filters for pdfquery getting attribute error

Background I'm using pdfquery to parse multiple files like this one. Problem I'm trying to write a generalized filer function, building off of the custom selectors mentioned in pdfquery's docs, that can take a specific range as an argument. Because…
James Draper
  • 5,110
  • 5
  • 40
  • 59
0
votes
1 answer

Read data from a PDF document that does not have an XFA-form

I use iText to read a PDF document containing an XFA form. I convert it to XML, read data from the XML and insert it in a datatbase. But if I dont have an XFA form in the PDF then how I can efficiently read data from the PDF?
hrishi
  • 1,610
  • 6
  • 26
  • 43
0
votes
1 answer

How is it possible to extract bookmarks from a PDF File in PHP using Smalot/PDFParser?

Right now I'm working with PHP and Laravel. My objective is to extract the most information possible out of an uploaded PDF file (using a Form and POST method) such as metadata (author, title, etc.), first page (cover), content of each page and the…
0
votes
0 answers

How to parse and regenerate PDF using php

I want to edit some part of the PDF and regenerates it in the same format after editing. I have tried pdftk but it won't allow to edit the readonly labels.i got success to parse the pdf using Smalot pdf parser but now i don't know how to again…
prakash tank
  • 1,269
  • 1
  • 9
  • 15
0
votes
0 answers

Coding a PDF Text Parser in swift

I'm currently developing a pdf text parser completely in swift. I was looking trough the PDFKittens code and found this in the stringwithpdfstring method (In SimpleFont.m) taking a CGPDFStringRef as parameter. const unsigned char *bytes =…
0
votes
1 answer

'Smalot PDF Parser' result: text not on the same line

So I installed PDF Parser (http://www.pdfparser.org/). I checked their website and used the demo. This gave me the result I wanted. After hours of searching how to use a composer I finally managed to get it working. Now I’m stuck with the next…
PHPeter
  • 567
  • 6
  • 19