Questions tagged [pdfminer]

A python-based tool for extracting information from PDF documents.

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

Features

  • Written entirely in Python. (for version 2.4 or newer)
  • Parse, analyze, and convert PDF documents.
  • PDF-1.7 specification support. (well, almost)
  • CJK languages and vertical writing scripts support.
  • Various font types (Type1, TrueType, Type3, and CID) support.
  • Basic encryption (RC4) support.
  • PDF to HTML conversion (with a sample converter web app).
  • Outline (TOC) extraction.
  • Tagged contents extraction.
  • Reconstruct the original layout by grouping text chunks.

PDFMiner is about 20 times slower than other C/C++-based counterparts such as XPdf.

(source)

492 questions
12
votes
2 answers

PDF Miner PDFEncryptionError

I'm trying to extract text from pdf-files and later try to identify the references. I'm using pdfminer 20140328. With unencrypted files its running well, but I got now a file where i get: File…
RichieK
  • 474
  • 6
  • 15
11
votes
4 answers

pdfminer - ImportError: No module named pdfminer.pdfdocument

I am trying to install pdfMiner to work with CollectiveAccess. My host (pair.com) has given me the following information to help in this quest: When compiling, it will likely be necessary to instruct the installation to use your account space…
KLL
  • 113
  • 1
  • 1
  • 5
10
votes
2 answers

How to use PDFminer.six with python 3?

I want to use pdfminer.six which is a tool, that can be used with Python3 for extracting information from PDF documents. The problem is there is no good documentation at all and no source code example on how to use the tool. I have already tried…
Urvish
  • 643
  • 3
  • 10
  • 19
10
votes
1 answer

What to do with CIDs in text extracted by PDFMiner?

I've some PDFs which are in Hindi, and have extractable text. I used pdfminer.six for python 3.6, to do the extraction. The output looks like: As one can see, there are a number of characters that are converted into the form "(cid :number)". On…
Mooncrater
  • 4,146
  • 4
  • 33
  • 62
9
votes
1 answer

pdfminer.high_level not showing up

I am trying to convert a PDF to plain text using the pdfminer.high_level.extract_text(). I keep getting this error message: File "/Users/ian/Documents/Resume Selector Project/resumeBackend.py", line 5, in digestResume text =…
iamianbrown
  • 91
  • 1
  • 1
  • 3
8
votes
1 answer

python pdfminer converts pdf file into one chunk of string with no spaces between words

I was using the following code mainly taken from DuckPuncher's answer to this post Extracting text from a PDF file using PDFMiner in python? to convert pdfs to text files: def convert_pdf_to_txt(path): rsrcmgr = PDFResourceManager() retstr =…
Yue Zhao
  • 154
  • 1
  • 3
  • 9
8
votes
2 answers

Parsing Index page in a PDF text book with Python

I have to extract text from PDF pages as it is with the indentation into a CSV file. Index page from PDF text book: I should split the text into class and subclass type hierarchy along with the page numbers. For example in the image, Application…
Aryan
  • 81
  • 1
  • 5
8
votes
1 answer

Python PDFMiner : How to link outlines to underlying text

I am trying to parse a PDF and create some kind of a hierarchical structure. Consider the input Title 1 some text some text some text some text some text some text some text some text some text some text some text some text some text some text…
AbtPst
  • 7,778
  • 17
  • 91
  • 172
8
votes
1 answer

decode CID font codes to equivalent ASCII characters

I'm trying to mine some text from a bunch of PDFs and a few of them have embedded CID fonts in the…
dino
  • 3,093
  • 4
  • 31
  • 50
7
votes
5 answers

How can I get the total count of total pages of a PDF file using PDFMiner in Python?

In pypdf, I can get the total number of pages of a PDF file via: from pypdf import PdfReader reader = PdfReader("example.pdf") no_of_pages = len(reader.pages) How can I get this using PDFMiner?
Malik Anas Ahmad
  • 103
  • 1
  • 1
  • 6
7
votes
6 answers

struct.error: unpack requires a string argument of length 16

While processing a PDF file (2.pdf) with pdfminer (pdf2txt.py) I received the following error: pdf2txt.py 2.pdf Traceback (most recent call last): File "/usr/local/bin/pdf2txt.py", line 115, in if __name__ == '__main__':…
Danil
  • 4,781
  • 1
  • 35
  • 50
7
votes
1 answer

Finding word on page(s) in document

I am looking for an elegant solution to find on what page(s) in a document a certain word occurs that I have stored in a python dictionary/list. I first considered .docx format as an input and had a look at PythonDocx which has a search function,…
birgit
  • 1,121
  • 2
  • 21
  • 39
7
votes
1 answer

Parsing a pdf(Devanagari script) using PDFminer gives incorrect output

I am trying to parse a pdf file containing Indian voters list which is in hindi(Devanagari script). PDF displays all the text correctly but when I tried dumping this pdf into text format using PDFminer it output the characters which are different…
Rohit
  • 179
  • 3
  • 14
6
votes
1 answer

Replace (cid:) with chars using Python when extracting text from PDF files

I wrote a code in Python that extracts text from PDF files. But for some files Im getting some strange output. This is my code: import requests from io import BytesIO from pdfminer.high_level import extract_text, extract_pages pdf_link =…
taga
  • 3,537
  • 13
  • 53
  • 119
6
votes
2 answers

Error: cannot import name 'PDFDocument' from 'pdfminer.pdfparser'

I need to extract text from pdf-files and have used pdfminer.six with success, extracting both text paragraphs and tables. But now I get an error related to the line from pdfminer.pdfparser import PDFParser, PDFDocument: ImportError: cannot…
Ingeborg
  • 369
  • 1
  • 5
  • 17
1
2
3
32 33