Questions tagged [pdfminer]

A python-based tool for extracting information from PDF documents.

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

Features

  • Written entirely in Python. (for version 2.4 or newer)
  • Parse, analyze, and convert PDF documents.
  • PDF-1.7 specification support. (well, almost)
  • CJK languages and vertical writing scripts support.
  • Various font types (Type1, TrueType, Type3, and CID) support.
  • Basic encryption (RC4) support.
  • PDF to HTML conversion (with a sample converter web app).
  • Outline (TOC) extraction.
  • Tagged contents extraction.
  • Reconstruct the original layout by grouping text chunks.

PDFMiner is about 20 times slower than other C/C++-based counterparts such as XPdf.

(source)

492 questions
6
votes
3 answers

Extract pdf text within bounding box directly into python

I'm trying to extract the text of a pdf within a given bounding rectangle. I understand there are tools for pdf scraping such as pdfminer, pypdf, and pdftotext. I've experimented with all 3, and so far I've only gotten code for pdftotext to extract…
Evan Mata
  • 500
  • 1
  • 6
  • 19
6
votes
1 answer

PyPDF2 to extract vertical text from scanned pdf

I am trying to extract text from the scanned pdf using PyPDF2. Some of the pdf contains text aligned vertically. But the orientation of the page is Portrait. Is there any way to identify if the text is vertically aligned and read vertical lines in…
Mms
  • 91
  • 4
6
votes
1 answer

How to use pdfminer.six's pdf2txt.py in python script and outside command line?

I know how to use pdfminer.six's pdf2txt.py tool in command line; however, I have many PDF files to convert to txt files and I can't just do it one-by-one in command line. I haven't found how to use this tool in actual python script. Any ideas?
Ashley Liu
  • 453
  • 1
  • 7
  • 17
6
votes
2 answers

How to extract tables from a pdf with PDFMiner?

I am trying to extract information from some tables in a pdf document. Consider the input: Title 1 some text some text some text some text some text some text some text some text some text some text Table Title | Col1 | Col2 | Col3 …
AbtPst
  • 7,778
  • 17
  • 91
  • 172
6
votes
2 answers

PDFminer empty output

While processing a file with pdfminer (pdf2txt.py) I received empty output: dan@work:~/project$ pdf2txt.py docs/homericaeast.pdf dan@work:~/project$ Can anybody say what wrong with this file and what I can do to get data from it? Here's…
Danil
  • 4,781
  • 1
  • 35
  • 50
6
votes
3 answers

Python pdfminer extract image produces multiple images per page (should be single image)

I am attempting to extract images that are in a PDF. The file I am working with is 2+ pages. Page 1 is text and pages 2-n are images (one per page, or it may be a single image spanning multiple pages; I do not have control over the origin). I am…
Erik
  • 898
  • 2
  • 8
  • 28
6
votes
1 answer

Python PDFMIner - PDF to CSV

I want to be able to convert PDFs to CSV files and have found several useful scripts but, being new to Python, I have a question: Where do you specify the filepath of the PDF and the CSV you want to print to? I'm using Python 2.7.11 and PDFMiner…
HB123
  • 61
  • 1
  • 1
  • 3
6
votes
1 answer

I want to scrape a Hindi(Indian Langage) pdf file with python

I have written python code that scrapes all the data from the PDF file. The problem here is that once it is scraped,the words lose their grammer. How to fix these problem? I am attaching the code. from pdfminer.pdfinterp import PDFResourceManager,…
6
votes
3 answers

Warnings on pdfminer

I have found and (slightly) modified this script in stackoverflow for it to work on python 3.3: from pdfminer.pdfinterp import PDFResourceManager, process_pdf from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from io…
rodrigocf
  • 1,951
  • 13
  • 39
  • 62
6
votes
2 answers

pdfminer3k has no method named create_pages in PDFPage

Since I want to move from python 2 to 3, I tried to work with pdfmine.3kr in python 3.4. It seems like they have edited everything. Their change logs do not reflect the changes they have done but I had no success in parsing pdf with pdfminer3k. For…
Jack_of_All_Trades
  • 10,942
  • 18
  • 58
  • 88
6
votes
1 answer

Can I use python's pdfminer to extract highlights from a pdf?

I wanted to try to extract highlighted text from a pdf, so I started looking at pdfminer but could not find any documentation for this specific function. Is this possible at all?
magicrebirth
  • 4,104
  • 2
  • 25
  • 22
6
votes
0 answers

pdfminer/poppler - how to set encoding

I have a file, i.e. http://www.agfl.cs.ru.nl/papers/manual28.pdf (it's english) Pdfminer and poppler shows the same result in most parsed pages, like: ¾º¿  ÒÙ Öݸ ¾¼¼ Ⱥ ¾º ÂÙÒ ¸ ¾¼¼ ź Ë ÙØØ Ö¸ Ǻ Ë It seems it can't read font custom encodings.…
night-crawler
  • 1,409
  • 1
  • 26
  • 39
6
votes
2 answers

python PDFminer only parses part of the page

I am parsing a PDF document using module pdfminer python module. I just want to extract text from this document. The process is going great but, when I extract LTText* objects, I realize that I am not getting all the text inside that LTText* object.…
juankysmith
  • 11,839
  • 5
  • 37
  • 62
5
votes
2 answers

Python PDF Parsing with Camelot and Extract the Table Title

Camelot is a fantastic Python library to extract the tables from a pdf file as a data frame. However, I'm looking for a solution that also returns the table description text written right above the table. The code I'm using for extracting tables…
Ali Asad
  • 1,235
  • 1
  • 18
  • 33
5
votes
5 answers

Python PDF read straight across as how it looks in the PDF

If I use the code in the answer here: Extracting text from a PDF file using PDFMiner in python? I can get the text to extract when applying to this pdf: https://www.tencent.com/en-us/articles/15000691526464720.pdf However, you see under…
jason
  • 3,811
  • 18
  • 92
  • 147
1 2
3
32 33