Questions tagged [pdfminer]

A python-based tool for extracting information from PDF documents.

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

Features

  • Written entirely in Python. (for version 2.4 or newer)
  • Parse, analyze, and convert PDF documents.
  • PDF-1.7 specification support. (well, almost)
  • CJK languages and vertical writing scripts support.
  • Various font types (Type1, TrueType, Type3, and CID) support.
  • Basic encryption (RC4) support.
  • PDF to HTML conversion (with a sample converter web app).
  • Outline (TOC) extraction.
  • Tagged contents extraction.
  • Reconstruct the original layout by grouping text chunks.

PDFMiner is about 20 times slower than other C/C++-based counterparts such as XPdf.

(source)

492 questions
0
votes
1 answer

Values from a class instance's attribute being added to a different instance of the same class

I'm parsing pdfs to extract table data using my PdfTable class. When I create a class instance then create another class instance it seems that the first class instance file_1.cells are being prepended to the second class instance file_2.cells. I…
Mox
  • 411
  • 3
  • 13
0
votes
1 answer

transform pdfminer bbox coordinates to iOS screen

I am doing an iPad application project in swift where I need to extract pdf word bbox coordinates and transform it to the iPad screen coordinates. The goal is that I be able to detect when a word is being touched. I am using a webview to display the…
Nilo0f4r
  • 168
  • 2
  • 12
0
votes
1 answer

work around for former CStringIO and String IO function in Python 3 Pdfinterp (Pdfminer)

I am using the pdfminer tool to convert pdf to .csv (text) and one of the subcommands in the tool pdfinterp.py still uses the CStringIO and StringIO for string to string translation - import re try: from CStringIO import StringIO except…
0
votes
1 answer

Write html tags to text file in python

I've used pdfminer to convert complex (tables, figures) and very long pdfs to html. I want to parse the results further (e.g. extract tables, paragraphs etc) and then use sentence tokenizer from nltk to do further analysis. For this purposes I want…
In777
  • 171
  • 1
  • 4
  • 15
0
votes
2 answers

Separate pdf to pages using pdfminer

I am trying to extract a pdf page by page and store the results in a dictionary as follows: from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from…
Echchama Nayak
  • 971
  • 3
  • 23
  • 44
0
votes
1 answer

PDF to TEXT converted in a wrong way

I am extracting the text from many PDF files using pdfminer. The result text file for some pdf files is strange where each line consits of one character only. Not all of the PDF files but some of them and I still can't find out why and which PDF…
The Maestro
  • 659
  • 1
  • 5
  • 21
0
votes
0 answers

Can't convert pdf to text even though trying pdfminer, pdf2txt, textract in Python

I'm having a trouble extracting text from pdf files which were originally converted from InDesign and Illustrator. I'm working on a project that needs data from these pdf files. I have tried pdfminer, pdf2txt libs in Python, but none of them works…
Nhi Tran
  • 11
  • 3
0
votes
1 answer

Losing information when extracting text from PDF using PDFMiner

I'm using Python 3.4 on Windows 7 and hoping I can extract text from PDF files using PDFMiner. However, losing information was quite common when I was testing. For some files, it may be just a matter of a few sentences. But I've encountered…
joe wong
  • 453
  • 2
  • 9
  • 24
0
votes
1 answer

Automate dekstop screening with Python

I am trying to make a program that could automatically scan the images or texts on a user's desktop and then convert it to a .txt file for text analysis. So far I have found source codes to convert PDF and HTML into .txt. However I would like to…
Kirsteen Ng
  • 111
  • 1
  • 8
0
votes
1 answer

Input coordinates in pdfminer and get results

I am trying to extract text in pdf miner by inputting co-ordinates, I have searched the internet but could not find any documentation or code relating to that.So far I have found a code that extracts text and outputs its co-ordinates.…
0
votes
1 answer

Passing argument to pdf2txt function

I'm trying to use PDFMiner to extract texts from PDF file. I wanted to use script pdf2txt.py to run the sample example in http://www.unixuser.org/~euske/python/pdfminer/index.html with this single line pdf2txt.py samples/simple1.pdf Since I'm…
Jason
  • 1,200
  • 1
  • 10
  • 25
0
votes
1 answer

What does preview app of OS X do to help extracting from pdf?

When I extracted content from a pdf file with 12 pages using my program based on pdfminer, I got wrong result with only 11 pages. I tested it with other files and got right result in most cases. By accident, I opened it with preview app in OS X…
soulcoder
  • 13
  • 4
0
votes
1 answer

Pull specific data from PDF file using text indices to locate

I’m parsing PDF files that show info for multiple different shipments of items. Data includes addresses, commodity amount, etc. I have successfully pulled the string of text that constitutes substance of each file. Files are relatively consistent in…
Murcielago
  • 1,030
  • 1
  • 14
  • 24
0
votes
1 answer

What should I use as piece of code to run PDFMiner 3k?

I want to use PDFMiner 3k, I'm using python 3.3.3 on windows, I don't know what instructions to write to use the PFDMiner 3k, I've tried many codes and still doesn't work, some of them were for the PDFMiner (python 2.7), for example I've tried the…
ziMtyth
  • 1,008
  • 16
  • 32
0
votes
1 answer

How write extracted image to file object instead of to file system?

I'm using the Python pdfminer library to extract both text and images from a PDF. Since the TextConverter class by default writes to sys.stdout, I used StringIO to catch the text as a variable as follows (see paste: def…
kramer65
  • 50,427
  • 120
  • 308
  • 488