Questions tagged [pdf-parsing]

Deals with extracting useful information from PDF content (for example, text or images)

PDF (Portable Document Format) is a binary format for digital documents. This tag is concerned with parsing these documents, that is to say, extract text, images or other data from them, or convert them to simpler formats (such as plain-text).

Because of the complexity of the PDF format (cf. the specification ISO 32000-1), its parsers are often incomplete (can't extract all information from all documents), and subject to security risks.

For example, pdf-parser is a command-line program that parses and analyses PDF documents. It provides features to extract raw data from PDF documents, like compressed images.

Python Related Options:

  • You may extract the table directly using camelot PDF Table Extraction for Humans
  • You may treat the pdf directly using tabula
  • You may convert the pdf to text using pdftotext, then parse text with python
  • You may use external tool, to convert your pdf file to excel or csv, then use required python module to open the excel/csv file.
  • You may also convert pdf to an image file, then use any recent OCR software (which reconstruct table automatically from the picture) to get data
  • pdf2image with pytesseract and an example.

Related Questions:

177 questions
0
votes
1 answer

How to avoid duplication in Python PDF parsing code for mismatching table structures?

I have over 100 PDFs that are match reports from which I want to scrape data in order to store it in dataframes so I can work with it afterwards. Problem is: Those PDFs don't always have the same structure and the reading from pdfplumber gives me…
0
votes
0 answers

Extracting data from pdf in specific format

I want to extract the data in form of a hash. The text is like this. Sometimes there is a single Arrest type and multiple charge and charge description and sometimes one Arrest type and one charge and charge description. Sometime multiple arrest…
snoozy
  • 25
  • 5
0
votes
0 answers

Clustering a set of letters into a table by position

I have a set of letters that are positioned on the plane (for each letter I know the coordinates of its corner points, and strings can be treated as parallelograms). I know that strings form a table, but I don't know neither how many rows or columns…
JohnDiGriz
  • 171
  • 13
0
votes
0 answers

How is Word Able to detect PDF structure so well where others fail? Is there a Library that can achieve this?

I've been interested in Parsing PDFs for some time now with varying degrees of sucess. Often however with PDFs useful data is contained in the text i.e. outside Tables etc. If you are to get data out of the sentences however, it is vital that the…
Nick
  • 789
  • 5
  • 22
0
votes
0 answers

Error from tabula-java: Error: Error: Header doesn't contain versioninfo

I have a script that parses pdf files. On my WSL it's perfectly working, but when i deploy it on Centos 7, I have this error. I'm using tabula-py python version: 3.6 java version: 11 When I try to search for the error, I found nothing. Can someone…
0
votes
1 answer

How to calculate coordinates of the PDF text (knowing only the list of operations)

I'm processing a PDF document in a program. The only part of the document I have access to is a list of PDF operations (with their arguments), and a list of horizontal displacements for the glyphs and fonts that appear in the document. Is it…
JohnDiGriz
  • 171
  • 13
0
votes
0 answers

How to identify whether the text is boxed in PDF using PDFBOX?

I am trying to check whether the text is BOXED using apache PDFBOX. for few PDF the below code wont work. public class PDFBoxReader extends PDFGraphicsStreamEngine { private static ArrayList recList = new ArrayList(); …
Arunkumar
  • 3
  • 2
0
votes
1 answer

How do I reference the PDF IFilter (dll) interface built into Windows to extract text and properties of a pdf document via Classic ASP

I want to extract text and properties (author, title, etc.) of PDF file. I need to extract and parse Text from a pdf file in a classic ASP environment. I read another post about using the PDF iFilter driver installed with Adobe Acrobat 9 which can…
Sanjeev
0
votes
1 answer

how to recognize a graph in pdf using python?

new to pdf parsing. I want to recognize a graph in a pdf file, so I could skip it and not extract this type of text. all I know about the pdf is that it is generated from word (not scanned). Input - pdf with a graph such as this one. output should…
0
votes
1 answer

How to extract text based on parts from a PDF file in JSON format?

From this file https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/text-extraction/Dart.pdf I would like to get this kind of result: { "file": { "title": "Dart Programming Language Specification", "1 Scope": { …
Tim
  • 513
  • 5
  • 20
0
votes
1 answer

Php Pdf Parser read content showing as a two lines. need to fix it

I used pdfparserto read PDF content. but one address line showing as a two line. in that time it is showing as a two new lines. i want to get that full address as a one line. pdf files are dynamic. according to the address length it is showing as a…
0
votes
1 answer

how to upload local pdf files to google collab notebook?

I want to upload a local pdf into google collab and parse it with python. How can I load the file so I could use with open?
0
votes
0 answers

Extract geometric objects (lines, circles,...) from a pdf using PDFMM

I have a PDF containing several geometric objects (mostly lines) in different sizes and color. I want to extract them in the following form, e.g. for lines: (startx, starty) (endx, endy) width color Optinal a "z" Position determining which object…
Peter
  • 1
0
votes
1 answer

Apache PDFBox - vertical match between image and text position

I need help to achieve a mapping between text and image objects in a PDF document. As the first figure shows, my PDF documents have 3 images arranged randomly in the y-direction. To the left of them are texts. The texts extend along the height of…
ralle
  • 15
  • 5
0
votes
1 answer

(while reading XRef): Error: Invalid XRef stream header?

hi i am trying to read pdf in node js . when i try to read this pdf. it start showing this error. (while reading XRef): Error: Invalid XRef stream header Error: Error: Invalid XRef stream header at error (eval at