Questions tagged [pdfminer]

A python-based tool for extracting information from PDF documents.

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

Features

Written entirely in Python. (for version 2.4 or newer)
Parse, analyze, and convert PDF documents.
PDF-1.7 specification support. (well, almost)
CJK languages and vertical writing scripts support.
Various font types (Type1, TrueType, Type3, and CID) support.
Basic encryption (RC4) support.
PDF to HTML conversion (with a sample converter web app).
Outline (TOC) extraction.
Tagged contents extraction.
Reconstruct the original layout by grouping text chunks.

PDFMiner is about 20 times slower than other C/C++-based counterparts such as XPdf.

(source)

492 questions

votes

2 answers

PDF Miner PDFEncryptionError

I'm trying to extract text from pdf-files and later try to identify the references. I'm using pdfminer 20140328. With unencrypted files its running well, but I got now a file where i get: File…

asked Dec 18 '15 at 14:19

RichieK

votes

4 answers

pdfminer - ImportError: No module named pdfminer.pdfdocument

I am trying to install pdfMiner to work with CollectiveAccess. My host (pair.com) has given me the following information to help in this quest: When compiling, it will likely be necessary to instruct the installation to use your account space…

python pdfminer

asked Mar 09 '16 at 23:29

KLL

votes

2 answers

How to use PDFminer.six with python 3?

I want to use pdfminer.six which is a tool, that can be used with Python3 for extracting information from PDF documents. The problem is there is no good documentation at all and no source code example on how to use the tool. I have already tried…

python-3.x pdfminer

asked Jun 07 '19 at 12:10

Urvish

votes

1 answer

What to do with CIDs in text extracted by PDFMiner?

I've some PDFs which are in Hindi, and have extractable text. I used pdfminer.six for python 3.6, to do the extraction. The output looks like: As one can see, there are a number of characters that are converted into the form "(cid :number)". On…

python pdf text pdfminer

asked Jun 09 '18 at 11:42

Mooncrater

4,146
4
33
62

votes

1 answer

pdfminer.high_level not showing up

I am trying to convert a PDF to plain text using the pdfminer.high_level.extract_text(). I keep getting this error message: File "/Users/ian/Documents/Resume Selector Project/resumeBackend.py", line 5, in digestResume text =…

python python-3.x module pdfminer

asked Nov 21 '20 at 22:47

iamianbrown

votes

1 answer

python pdfminer converts pdf file into one chunk of string with no spaces between words

I was using the following code mainly taken from DuckPuncher's answer to this post Extracting text from a PDF file using PDFMiner in python? to convert pdfs to text files: def convert_pdf_to_txt(path): rsrcmgr = PDFResourceManager() retstr =…

python-3.x pdfminer

asked Mar 23 '18 at 19:56

Yue Zhao

votes

2 answers

Parsing Index page in a PDF text book with Python

I have to extract text from PDF pages as it is with the indentation into a CSV file. Index page from PDF text book: I should split the text into class and subclass type hierarchy along with the page numbers. For example in the image, Application…

python pdfminer pdftotext named-entity-recognition nlp

asked Mar 03 '18 at 18:35

Aryan

votes

1 answer

Python PDFMiner : How to link outlines to underlying text

I am trying to parse a PDF and create some kind of a hierarchical structure. Consider the input Title 1 some text some text some text some text some text some text some text some text some text some text some text some text some text some text…

python parsing pdf pdfminer

asked Sep 14 '17 at 15:05

AbtPst

7,778
17
91
172

votes

1 answer

decode CID font codes to equivalent ASCII characters

I'm trying to mine some text from a bunch of PDFs and a few of them have embedded CID fonts in the…

python fonts pdfminer

asked Jun 06 '14 at 19:24

dino

3,093
4
31
50

votes

5 answers

How can I get the total count of total pages of a PDF file using PDFMiner in Python?

In pypdf, I can get the total number of pages of a PDF file via: from pypdf import PdfReader reader = PdfReader("example.pdf") no_of_pages = len(reader.pages) How can I get this using PDFMiner?

python pdfminer

asked Aug 23 '17 at 13:23

Malik Anas Ahmad

votes

6 answers

struct.error: unpack requires a string argument of length 16

While processing a PDF file (2.pdf) with pdfminer (pdf2txt.py) I received the following error: pdf2txt.py 2.pdf Traceback (most recent call last): File "/usr/local/bin/pdf2txt.py", line 115, in if __name__ == '__main__':…

python pdf pdftotext pdfminer pdf-parsing

asked Oct 20 '16 at 15:28

Danil

4,781
1
35
50

votes

1 answer

Finding word on page(s) in document

I am looking for an elegant solution to find on what page(s) in a document a certain word occurs that I have stored in a python dictionary/list. I first considered .docx format as an input and had a look at PythonDocx which has a search function,…

python python-docx pdfminer

asked Sep 05 '15 at 22:21

birgit

1,121
2
21
39

votes

1 answer

Parsing a pdf(Devanagari script) using PDFminer gives incorrect output

I am trying to parse a pdf file containing Indian voters list which is in hindi(Devanagari script). PDF displays all the text correctly but when I tried dumping this pdf into text format using PDFminer it output the characters which are different…

python parsing pdf hindi pdfminer

asked Aug 07 '15 at 11:15

Rohit

votes

1 answer

Replace (cid:) with chars using Python when extracting text from PDF files

I wrote a code in Python that extracts text from PDF files. But for some files Im getting some strange output. This is my code: import requests from io import BytesIO from pdfminer.high_level import extract_text, extract_pages pdf_link =…

python pdf encoding pdfminer

asked Mar 16 '21 at 13:20

taga

3,537
13
53
119

votes

2 answers

Error: cannot import name 'PDFDocument' from 'pdfminer.pdfparser'

I need to extract text from pdf-files and have used pdfminer.six with success, extracting both text paragraphs and tables. But now I get an error related to the line from pdfminer.pdfparser import PDFParser, PDFDocument: ImportError: cannot…

python-3.x pdfminer

asked May 07 '19 at 13:24

Ingeborg

Prev 1

…

32 33 Next