Questions tagged [pdf-parsing]

Deals with extracting useful information from PDF content (for example, text or images)

PDF (Portable Document Format) is a binary format for digital documents. This tag is concerned with parsing these documents, that is to say, extract text, images or other data from them, or convert them to simpler formats (such as plain-text).

Because of the complexity of the PDF format (cf. the specification ISO 32000-1), its parsers are often incomplete (can't extract all information from all documents), and subject to security risks.

For example, pdf-parser is a command-line program that parses and analyses PDF documents. It provides features to extract raw data from PDF documents, like compressed images.

Python Related Options:

  • You may extract the table directly using camelot PDF Table Extraction for Humans
  • You may treat the pdf directly using tabula
  • You may convert the pdf to text using pdftotext, then parse text with python
  • You may use external tool, to convert your pdf file to excel or csv, then use required python module to open the excel/csv file.
  • You may also convert pdf to an image file, then use any recent OCR software (which reconstruct table automatically from the picture) to get data
  • pdf2image with pytesseract and an example.

Related Questions:

177 questions
7
votes
6 answers

struct.error: unpack requires a string argument of length 16

While processing a PDF file (2.pdf) with pdfminer (pdf2txt.py) I received the following error: pdf2txt.py 2.pdf Traceback (most recent call last): File "/usr/local/bin/pdf2txt.py", line 115, in if __name__ == '__main__':…
Danil
  • 4,781
  • 1
  • 35
  • 50
7
votes
1 answer

Difference between iTextSharp 4.1.6 and 5.x versions

We are developing a Pdf parser to be used along with our system. The requirement is such that, we store all the information on any pdf documents and should be able to reproduce the document as such (with minimal changes from original document). We…
Shankar
  • 327
  • 2
  • 6
  • 16
6
votes
0 answers

Regarding No Unicode mapping error while parsing pdf

I have bunch of pdf files (from different sources) and I'd like to extract text from them (unfortunately can't attach the files). Current parsing outcome: Tika silently returns text, which is missing a lot of needed data. Using PDFBox directly…
exenza
  • 966
  • 10
  • 21
6
votes
2 answers

PDFminer empty output

While processing a file with pdfminer (pdf2txt.py) I received empty output: dan@work:~/project$ pdf2txt.py docs/homericaeast.pdf dan@work:~/project$ Can anybody say what wrong with this file and what I can do to get data from it? Here's…
Danil
  • 4,781
  • 1
  • 35
  • 50
6
votes
2 answers

Apache PDFBox Remove Spaces between characters

We are using PDFBox to extract text from PDF's. Some PDF's text can't be extract correctly. The following image shows a part from the PDF as image: After text extraction we get the following text: 3, 8 5 EU R 1 Netto 38,50 EUR…
TobiasH
  • 83
  • 1
  • 8
6
votes
1 answer

get text paragraph from pdf using itextsharp

is there any logic to get paragraph text from pdf file using itextsharp?i know pdf only supports run of texts and its hard to determine which runs of texts are related to which paragraph and also i know that there isn't any

tags or other tags to…

Bibek Gautam
  • 581
  • 8
  • 30
6
votes
1 answer

haskell - parsing/reading content of .pdf-files

is there any possibility in haskell to just decrypt a .pdf file, read in the content and return a String? And, if there is one, could you give me a little example like e.g.: ... import necessaryPackage ... pdfParsing = ... ... Thanks in…
jimmyt
  • 491
  • 4
  • 10
5
votes
2 answers

Parsing PDF files in Hadoop Map Reduce

I have to parse PDF files , that are in HDFS in a Map Reduce Program in Hadoop. So i get the PDF file from HDFS as Input splits and it has to be parsed and sent to the Mapper Class. For implementing this InputFormat I had gone through this link .…
WR10
  • 443
  • 1
  • 4
  • 16
5
votes
1 answer

Identify and extract specific sections of a PDF document

I have several exams in PDF format. I want to programatically extract each question as a separate image/document. OCR is not ideal because it does not maintain code/equation formatting well. The end goal is to make flash cards with each card…
aki
  • 164
  • 1
  • 1
  • 12
5
votes
2 answers

PDF Cross Reference Streams

I'm developing a PDF parser/writer, but I'm stuck at generating cross reference streams. My program reads this file and then removes its linearization, and decompresses all objects in object streams. Finally it builds the PDF file and saves it. This…
Van Coding
  • 24,244
  • 24
  • 88
  • 132
5
votes
2 answers

Extracting tables from a pdf

I'm trying to get the data from the tables in this PDF. I've tried pdfminer and pypdf with a little luck but I can't really get the data from the tables. This is what one of the tables looks like: As you can see, some columns are marked with an…
user
  • 715
  • 4
  • 13
  • 32
5
votes
1 answer

Python PDFMiner error: "No /Root object! - Is this really a PDF?"

I am getting this error "No /Root object! - Is this really a PDF?" using my MAC computer with Python 2.7 and PDFMiner version 20110515. The pdf files are not damaged because the same program with the same files works on my PC computer! Also I have…
Mahshid Zeinaly
  • 3,590
  • 6
  • 25
  • 32
4
votes
2 answers

Extract all text with string positions from a PDF

This may seem an old question, but I didn't find an exhaustive answer after spending half an hour searching all over SO. I am using PDFBox and I would like to extract all of the text from a PDF file along with the coordinates of each string. I am…
Andrea Sprega
  • 2,221
  • 2
  • 29
  • 35
4
votes
3 answers

AttributeError: 'bytes' object has no attribute 'close' when Tika parser is run

Im trying to run a simple parse line of code using Tika to parse text from a PDF (named outputFileName in this example). This used to run without errors. I recently had my laptop sent in to our work IT for software updates and had to resintall…
dweir247
  • 63
  • 4
4
votes
1 answer

node.js How to use a url as pdf-path to work with pdf2json

I'm using node.js and pdf2json parser to parse a pdf file. Currently it is working with a local pdf file. But I'm trying to get a pdf-file through the URL/HTTP Module of node.js and I want to open this file to parse it. Is there any possibility to…
Daniel Wahl
  • 59
  • 1
  • 7
1
2
3
11 12