Questions tagged [pdfminer]

A python-based tool for extracting information from PDF documents.

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

Features

  • Written entirely in Python. (for version 2.4 or newer)
  • Parse, analyze, and convert PDF documents.
  • PDF-1.7 specification support. (well, almost)
  • CJK languages and vertical writing scripts support.
  • Various font types (Type1, TrueType, Type3, and CID) support.
  • Basic encryption (RC4) support.
  • PDF to HTML conversion (with a sample converter web app).
  • Outline (TOC) extraction.
  • Tagged contents extraction.
  • Reconstruct the original layout by grouping text chunks.

PDFMiner is about 20 times slower than other C/C++-based counterparts such as XPdf.

(source)

492 questions
5
votes
1 answer

Python pdfminer LAParams mixes text output

i have a pdf file and i wanna parse text from it with pdfminer.The problem is LAParams sometimes fails and give some portion of the line at the end.I can't figure out why. My pdf looks like this: Out put looks like this: My code is here,thanks in…
5
votes
1 answer

Is it possible to use regular expressions with pdfquery?

Can we use regex to detect text within a pdf (using pdfquery or another tool)? I know we can do this: pdf = pdfquery.PDFQuery("tests/samples/IRS_1040A.pdf") pdf.load() label = pdf.pq('LTTextLineHorizontal:contains("Cash")') left_corner =…
Dayvid Oliveira
  • 1,157
  • 2
  • 14
  • 34
5
votes
0 answers

Is advanced PDF parsing doable with current software around?

We have a project that we are hoping to realize and in this project we need to deal with PDF files (unfortunately) and parsing their content. For the last few days we have been researching a lot about different libraries and we tried few of those.…
ralzaul
  • 4,280
  • 6
  • 32
  • 51
5
votes
1 answer

PDF text extraction returns wrong characters due to ToUnicode map

I am trying to extract text from a foreign language PDF file using PDFMiner, but am being foiled by a ToUnicode statement. The file behaves strangely even under normal PDF viewers. For example, here is a screenshot from some text in the file: But…
pnj
  • 1,349
  • 1
  • 11
  • 14
5
votes
2 answers

Extracting tables from a pdf

I'm trying to get the data from the tables in this PDF. I've tried pdfminer and pypdf with a little luck but I can't really get the data from the tables. This is what one of the tables looks like: As you can see, some columns are marked with an…
user
  • 715
  • 4
  • 13
  • 32
4
votes
0 answers

How to fix - TypeError: int() argument must be a string, a bytes-like object or a number, not 'PSKeyword'?

I'm attempting to extract text from a pdf file using pdfminer and I am getting this problem, but only for some files. The code runs well on certain pdfs, but returns this error message for others. This is my code (Which I've copied over from other…
Surya S
  • 41
  • 1
4
votes
0 answers

Get annotation text from its position (PDFMiner)

I want to extract the text of annotations (such as highlighted text of hyperlinks) from its position. For this I could scrape the positions and urls by using PDFminer as in the below code. Is that possible passing this position to a layout object…
alien
  • 63
  • 3
4
votes
1 answer

unexpected keyword argument 'codec' in XMLConverter

Below error message: device = XMLConverter(rsrcmgr, retstr, laparams=laparams, codec=codec) TypeError: __init__() got an unexpected keyword argument 'codec' Original Code: rsrcmgr = PDFResourceManager() retstr = BytesIO() codec = 'utf-8' laparams =…
Subash Nadar
  • 101
  • 2
  • 6
4
votes
0 answers

LTRect and LTLine extraction - pdfminer

I am using pdfminer to extract line and rectangle objects in pdf documents but for some pdf's LTRect and LTLine objects are not getting identified even though there are lines and rectangles in it. Could you please suggest why these objects are not…
Anvitha
  • 89
  • 4
4
votes
0 answers

How to remove header and footer while extracting multiple page PDF to Text using PDFminer?

I've succesfully extracted text from multiple page PDF's, using PDFminer.six in Python, and converted it into a single string, but I would like to remove the header and footer of each page while extracting the PDF to text. So far similar questions…
Peter
  • 41
  • 1
  • 4
4
votes
1 answer

Detecting sections of a pdf with pdfminer

I am trying to transform pdfs from conference/journal papers into .txt files. I basically want to have a structure a bit cleaner that the current pdf: no line break before the end of a sentence and highlighting sections of the paper. The problem I…
LBes
  • 3,366
  • 1
  • 32
  • 66
4
votes
3 answers

Text Scraping a PDF with Python (pdfquery)

I need to scrape some PDF files to extract the following text information: I have attempted to do this using pdfquery, by working off an example I found on Reddit (see first post):…
Freya
  • 71
  • 1
  • 1
  • 6
4
votes
1 answer

Using pdfminer, code gets stuck on command interpretor.process_page(page), and never terminates or throws an error

I'm having some trouble with the PDFPageInterpreter in pdfminer. The below code has worked for me on every pdf file I've seen up till now, but I recently found out that when faced with a pdf page with an insane amount of text on it (like a condensed…
Malcoto
  • 89
  • 6
4
votes
2 answers

Convert text dump of a binary string into real string

A python library outputs me text-dumped binary utf-8 strings, like that: In [1]: string Out[1]: "b'\\xd0\\x9f\\xd1\\x80\\xd0\\xb5\\xd0\\xb4\\xd0\\xb8\\xd1\\x81\\xd0\\xbb\\xd0\\xbe\\xd0\\xb2\\xd0\\xb8\\xd0\\xb5'" In [2]: type(string) Out[2]: str I…
krvkir
  • 771
  • 7
  • 12
4
votes
0 answers

Pyinstaller cannot load native module 'Crypto.Cipher.__raw_ecb'

When attempting to run my program i receive this error from the command line: Traceback (most recent call last): File "cp file.py", line 16, in File "", line 971, in _find_and_load File "
user488476
  • 39
  • 3