Questions tagged [pdfminer]

A python-based tool for extracting information from PDF documents.

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

Features

Written entirely in Python. (for version 2.4 or newer)
Parse, analyze, and convert PDF documents.
PDF-1.7 specification support. (well, almost)
CJK languages and vertical writing scripts support.
Various font types (Type1, TrueType, Type3, and CID) support.
Basic encryption (RC4) support.
PDF to HTML conversion (with a sample converter web app).
Outline (TOC) extraction.
Tagged contents extraction.
Reconstruct the original layout by grouping text chunks.

PDFMiner is about 20 times slower than other C/C++-based counterparts such as XPdf.

(source)

492 questions

votes

1 answer

Values from a class instance's attribute being added to a different instance of the same class

I'm parsing pdfs to extract table data using my PdfTable class. When I create a class instance then create another class instance it seems that the first class instance file_1.cells are being prepended to the second class instance file_2.cells. I…

python pdfminer

asked Nov 12 '16 at 22:42

Mox

votes

1 answer

transform pdfminer bbox coordinates to iOS screen

I am doing an iPad application project in swift where I need to extract pdf word bbox coordinates and transform it to the iPad screen coordinates. The goal is that I be able to detect when a word is being touched. I am using a webview to display the…

ios swift pdfminer

asked Sep 23 '16 at 01:58

Nilo0f4r

votes

1 answer

work around for former CStringIO and String IO function in Python 3 Pdfinterp (Pdfminer)

I am using the pdfminer tool to convert pdf to .csv (text) and one of the subcommands in the tool pdfinterp.py still uses the CStringIO and StringIO for string to string translation - import re try: from CStringIO import StringIO except…

python c-strings pdfminer

asked Sep 13 '16 at 17:59

Jose Rodriguez

votes

1 answer

Write html tags to text file in python

I've used pdfminer to convert complex (tables, figures) and very long pdfs to html. I want to parse the results further (e.g. extract tables, paragraphs etc) and then use sentence tokenizer from nltk to do further analysis. For this purposes I want…

python io pdftotext pdfminer pdf-to-html

asked Jul 22 '16 at 13:49

In777

votes

2 answers

Separate pdf to pages using pdfminer

I am trying to extract a pdf page by page and store the results in a dictionary as follows: from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from…

python pdfminer

asked Jul 03 '16 at 17:34

Echchama Nayak

votes

1 answer

PDF to TEXT converted in a wrong way

I am extracting the text from many PDF files using pdfminer. The result text file for some pdf files is strange where each line consits of one character only. Not all of the PDF files but some of them and I still can't find out why and which PDF…

python pdfminer

asked Jun 23 '16 at 13:23

The Maestro

votes

0 answers

Can't convert pdf to text even though trying pdfminer, pdf2txt, textract in Python

I'm having a trouble extracting text from pdf files which were originally converted from InDesign and Illustrator. I'm working on a project that needs data from these pdf files. I have tried pdfminer, pdf2txt libs in Python, but none of them works…

python text adobe-indesign pdf-conversion pdfminer

asked Jun 21 '16 at 18:09

Nhi Tran

votes

1 answer

Losing information when extracting text from PDF using PDFMiner

I'm using Python 3.4 on Windows 7 and hoping I can extract text from PDF files using PDFMiner. However, losing information was quite common when I was testing. For some files, it may be just a matter of a few sentences. But I've encountered…

python python-3.x pdf poppler pdfminer

asked Jun 16 '16 at 02:27

joe wong

votes

1 answer

Automate dekstop screening with Python

I am trying to make a program that could automatically scan the images or texts on a user's desktop and then convert it to a .txt file for text analysis. So far I have found source codes to convert PDF and HTML into .txt. However I would like to…

python-2.7 pdfminer

asked Apr 10 '16 at 05:12

Kirsteen Ng

votes

1 answer

Input coordinates in pdfminer and get results

I am trying to extract text in pdf miner by inputting co-ordinates, I have searched the internet but could not find any documentation or code relating to that.So far I have found a code that extracts text and outputs its co-ordinates.…

python pdfminer

asked Feb 23 '16 at 09:35

Raja Ramachandran

votes

1 answer

Passing argument to pdf2txt function

I'm trying to use PDFMiner to extract texts from PDF file. I wanted to use script pdf2txt.py to run the sample example in http://www.unixuser.org/~euske/python/pdfminer/index.html with this single line pdf2txt.py samples/simple1.pdf Since I'm…

python python-2.7 command-line-arguments python-idle pdfminer

asked Oct 24 '15 at 03:03

Jason

1,200
1
10
25

votes

1 answer

What does preview app of OS X do to help extracting from pdf?

When I extracted content from a pdf file with 12 pages using my program based on pdfminer, I got wrong result with only 11 pages. I tested it with other files and got right result in most cases. By accident, I opened it with preview app in OS X…

python pdf pdfminer

asked Aug 25 '15 at 09:49

soulcoder

votes

1 answer

Pull specific data from PDF file using text indices to locate

I’m parsing PDF files that show info for multiple different shipments of items. Data includes addresses, commodity amount, etc. I have successfully pulled the string of text that constitutes substance of each file. Files are relatively consistent in…

python regex pdf pdfminer

asked Aug 08 '15 at 04:40

Murcielago

1,030
1
14
24

votes

1 answer

What should I use as piece of code to run PDFMiner 3k?

I want to use PDFMiner 3k, I'm using python 3.3.3 on windows, I don't know what instructions to write to use the PFDMiner 3k, I've tried many codes and still doesn't work, some of them were for the PDFMiner (python 2.7), for example I've tried the…

python hash nlp pypi pdfminer

asked Apr 20 '15 at 09:26

ziMtyth

1,008
16
32

votes

1 answer

How write extracted image to file object instead of to file system?

I'm using the Python pdfminer library to extract both text and images from a PDF. Since the TextConverter class by default writes to sys.stdout, I used StringIO to catch the text as a variable as follows (see paste: def…

python pdf io stream pdfminer

asked Dec 15 '14 at 09:38

kramer65

50,427
120
308
488

Prev 1 2 3

…

32 33 Next