Questions tagged [pdfminer]

A python-based tool for extracting information from PDF documents.

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

Features

Written entirely in Python. (for version 2.4 or newer)
Parse, analyze, and convert PDF documents.
PDF-1.7 specification support. (well, almost)
CJK languages and vertical writing scripts support.
Various font types (Type1, TrueType, Type3, and CID) support.
Basic encryption (RC4) support.
PDF to HTML conversion (with a sample converter web app).
Outline (TOC) extraction.
Tagged contents extraction.
Reconstruct the original layout by grouping text chunks.

PDFMiner is about 20 times slower than other C/C++-based counterparts such as XPdf.

(source)

492 questions

votes

3 answers

Extract pdf text within bounding box directly into python

I'm trying to extract the text of a pdf within a given bounding rectangle. I understand there are tools for pdf scraping such as pdfminer, pypdf, and pdftotext. I've experimented with all 3, and so far I've only gotten code for pdftotext to extract…

asked Apr 09 '19 at 00:26

Evan Mata

votes

1 answer

PyPDF2 to extract vertical text from scanned pdf

I am trying to extract text from the scanned pdf using PyPDF2. Some of the pdf contains text aligned vertically. But the orientation of the page is Portrait. Is there any way to identify if the text is vertically aligned and read vertical lines in…

python python-3.x pypdf pdfminer pdf-extraction

asked Sep 27 '18 at 05:53

Mms

votes

1 answer

How to use pdfminer.six's pdf2txt.py in python script and outside command line?

I know how to use pdfminer.six's pdf2txt.py tool in command line; however, I have many PDF files to convert to txt files and I can't just do it one-by-one in command line. I haven't found how to use this tool in actual python script. Any ideas?

python python-3.x python-3.6 pdfminer

asked Sep 20 '18 at 01:31

Ashley Liu

votes

2 answers

How to extract tables from a pdf with PDFMiner?

I am trying to extract information from some tables in a pdf document. Consider the input: Title 1 some text some text some text some text some text some text some text some text some text some text Table Title | Col1 | Col2 | Col3 …

python parsing pdf pdfminer

asked Sep 14 '17 at 15:20

AbtPst

7,778
17
91
172

votes

2 answers

PDFminer empty output

While processing a file with pdfminer (pdf2txt.py) I received empty output: dan@work:~/project$ pdf2txt.py docs/homericaeast.pdf dan@work:~/project$ Can anybody say what wrong with this file and what I can do to get data from it? Here's…

python pdf pdfminer pdf-parsing

asked May 07 '17 at 14:10

Danil

4,781
1
35
50

votes

3 answers

Python pdfminer extract image produces multiple images per page (should be single image)

I am attempting to extract images that are in a PDF. The file I am working with is 2+ pages. Page 1 is text and pages 2-n are images (one per page, or it may be a single image spanning multiple pages; I do not have control over the origin). I am…

python-2.7 pdfminer

asked Jul 11 '16 at 22:41

Erik

votes

1 answer

Python PDFMIner - PDF to CSV

I want to be able to convert PDFs to CSV files and have found several useful scripts but, being new to Python, I have a question: Where do you specify the filepath of the PDF and the CSV you want to print to? I'm using Python 2.7.11 and PDFMiner…

python csv pdf pdfminer

asked Apr 27 '16 at 23:10

HB123

votes

1 answer

I want to scrape a Hindi(Indian Langage) pdf file with python

I have written python code that scrapes all the data from the PDF file. The problem here is that once it is scraped,the words lose their grammer. How to fix these problem? I am attaching the code. from pdfminer.pdfinterp import PDFResourceManager,…

python pdf ocr pdfminer pdf-scraping

asked Mar 14 '16 at 18:50

Abhinav Mishra

votes

3 answers

Warnings on pdfminer

I have found and (slightly) modified this script in stackoverflow for it to work on python 3.3: from pdfminer.pdfinterp import PDFResourceManager, process_pdf from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from io…

python pdf python-3.x pdfminer

asked Apr 21 '15 at 04:05

rodrigocf

1,951
13
39
62

votes

2 answers

pdfminer3k has no method named create_pages in PDFPage

Since I want to move from python 2 to 3, I tried to work with pdfmine.3kr in python 3.4. It seems like they have edited everything. Their change logs do not reflect the changes they have done but I had no success in parsing pdf with pdfminer3k. For…

python pdfminer

asked Oct 16 '14 at 20:21

Jack_of_All_Trades

10,942
18
58
88

votes

1 answer

Can I use python's pdfminer to extract highlights from a pdf?

I wanted to try to extract highlighted text from a pdf, so I started looking at pdfminer but could not find any documentation for this specific function. Is this possible at all?

python pdf pdfminer

asked Aug 13 '14 at 11:42

magicrebirth

4,104
2
25
22

votes

0 answers

pdfminer/poppler - how to set encoding

I have a file, i.e. http://www.agfl.cs.ru.nl/papers/manual28.pdf (it's english) Pdfminer and poppler shows the same result in most parsed pages, like: ¾º¿ Â ÒÙ ÖÝ¸ ¾¼¼ Èº ¾º ÂÙÒ ¸ ¾¼¼ Åº Ë ÙØØ Ö¸ Çº Ë It seems it can't read font custom encodings.…

python encoding poppler pdfminer

asked Feb 06 '14 at 07:56

night-crawler

1,409
1
26
39

votes

2 answers

python PDFminer only parses part of the page

I am parsing a PDF document using module pdfminer python module. I just want to extract text from this document. The process is going great but, when I extract LTText* objects, I realize that I am not getting all the text inside that LTText* object.…

python parsing pdf pdfminer

asked Nov 07 '13 at 10:11

juankysmith

11,839
5
37
62

votes

2 answers

Python PDF Parsing with Camelot and Extract the Table Title

Camelot is a fantastic Python library to extract the tables from a pdf file as a data frame. However, I'm looking for a solution that also returns the table description text written right above the table. The code I'm using for extracting tables…

python pdfminer tabula python-camelot

asked Oct 01 '19 at 13:04

Ali Asad

1,235
1
18
33

votes

5 answers

Python PDF read straight across as how it looks in the PDF

If I use the code in the answer here: Extracting text from a PDF file using PDFMiner in python? I can get the text to extract when applying to this pdf: https://www.tencent.com/en-us/articles/15000691526464720.pdf However, you see under…

python pdf pdfminer pypdf

asked Jul 21 '18 at 21:47

jason

3,811
18
92
147

Prev 1 2

…

32 33 Next