Questions tagged [pdfminer]

A python-based tool for extracting information from PDF documents.

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

Features

Written entirely in Python. (for version 2.4 or newer)
Parse, analyze, and convert PDF documents.
PDF-1.7 specification support. (well, almost)
CJK languages and vertical writing scripts support.
Various font types (Type1, TrueType, Type3, and CID) support.
Basic encryption (RC4) support.
PDF to HTML conversion (with a sample converter web app).
Outline (TOC) extraction.
Tagged contents extraction.
Reconstruct the original layout by grouping text chunks.

PDFMiner is about 20 times slower than other C/C++-based counterparts such as XPdf.

(source)

492 questions

votes

1 answer

How to use pdfMiner in python to predicatbly read values

I've been using pdfMiner to read values off of graphs and so far its been working great! However there is one area in which the correct data is read correctly but in an unpredictable manner, meaning it will read all the graphs values correctly, in a…

python pdfminer pdf-manipulation

asked Dec 03 '14 at 06:58

Jeff

votes

2 answers

python - pull pdfs from webpage and convert to html

My goal is to have a python script that will access particular webpages, extract all pdf files on each page that have a certain word in their filename, convert them into html/xml, then go through the html files to read data from the pdfs' tables. So…

python xpath scrapy pdf-extraction pdfminer

asked Feb 18 '14 at 21:06

maniciam

-1

votes

1 answer

How to visualize bounding boxes extracted from pdfminer.six?

I have a diagram in a PDF format. I am using pdfminer.six to extract the text present in the diagram as well as the bounding boxes of the text. Everything is fine so far. System info: Windows 10, Python 3.9.13 Now I want to draw these bounding…

python opencv pdf image-processing pdfminer

asked Jul 31 '23 at 18:21

tintin98

-1

votes

1 answer

New error but no code changes! - TypeError: '<=' not supported

I am an inexperienced developer cobbling things together but no deep knowledge and I have come unstuck! I have a Google Cloud Function running some code that uses pdfmier.six to extract content from a PDF and that has been working well. However I…

python python-3.x google-cloud-functions pdfminer

asked Mar 24 '23 at 10:06

Phil Wakefield

-1

votes

1 answer

Extracting images from a PDF using PyPDF2 - but the pdf has no metadata

The PDF is a scanned image, so there is no way I have found yet, to pull out the images. I have tried methods including crop and media boxes, but it pulls the entire pages as images. I have also tried other parsing libraries like pdfminer.six, but…

pypdf pdfminer

asked Mar 10 '23 at 16:47

olivia harman

-1

votes

2 answers

Extract metadata info from online pdf using pdfminer in python

I am interested to find out some metadata of an online pdf using pdfminer. I am interested in extracting info such as Title, author, no of lines etc from the pdf I am trying to use a related solution discussed…

python web-scraping pdfminer pdf-scraping

asked Feb 28 '23 at 11:24

Bitopan Gogoi

-1

votes

1 answer

I cannot find a way to extract underlined text, cant it be done with pdfminer.six?

I am trying to extract a text in pdf which is underlined using python but not able to find a correct solution can anyone help on this, please

python pdf pypdf pdfminer pdfplumber

asked Jul 16 '21 at 12:04

ram gengadar

-1

votes

1 answer

How to use pdfminer3 to iterate through multiple PDF files in a directory

I am trying to iterate through many PDF files to extract their text and place them into an excel file. pdfminer3 has allowed me to do so with only one PDF file but I am having trouble with iterating through many PDF files. from pdfminer3.layout…

python-3.x for-loop pdfminer

asked May 22 '21 at 18:54

Leeee

-1

votes

1 answer

How to flip a pdf page upside down using python?

I'm trying to flip pdf pages upside down using python. I have tried multiple libraries like PyPdf2, PyMuPDF and pdfminer. There is documentation on how to rotate a page, but that is not what I'm looking for. The closest solution I found was on one…

python pdf pypdf pdfminer pymupdf

asked Aug 23 '20 at 10:09

Ajay Alex

-1

votes

1 answer

How to distinguish uploaded PDFs to extract data through regular expression in python Django

Here are uploaded pdfs and it will convert it into text. After converting into text I use a regular expression to get some specific data from the pdfs. Now there are various kinds of pdfs and I have to use different types of regular expression for…

python django pdf pdfminer pdf-extraction

asked Apr 15 '20 at 21:46

zenvar

-1

votes

1 answer

Extract text from multiple PDFs and write to a single CSV

I want to loop through all the PDFs in a directory, extract the text from each one using PDFminer, and then write the output to a single CSV file. I am able to extract the text from each PDF individually by passing it to the function defined here. I…

python pandas pdf text-extraction pdfminer

asked Feb 29 '20 at 12:39

b00kgrrl

-1

votes

1 answer

Extract Numbers from a certain location in PDF files

I'm trying to write a script to extract numbers from the "Total Deviation" graph in pdf files that looks like this. The reason I am trying to extract the information from the location of the graph rather than parsing the whole file and filtering it…

python pdf pdfminer

asked Dec 14 '19 at 07:39

Pen Gerald

-1

votes

2 answers

PDFMiner TypeError: not all arguments converted during string formatting

I have been trying to come up with a solution to parse a PDF into an HTML so, later I'll use beautiful soup to extract all the headings, subitems and paragraph respectively in a tree structure. I have searched a few options available on the internet…

python python-3.x pdfminer

asked Sep 24 '19 at 13:12

Ali Asad

1,235
1
18
33

-1

votes

1 answer

PDF miner, bad new line detection

I am using this code to get text data from PDF : def pdf_to_txt(path): manager = PDFResourceManager() retstr = BytesIO() layout = LAParams(all_texts=True) device = TextConverter(manager, retstr, laparams=layout) filepath =…

python pdfminer

asked Aug 26 '19 at 07:01

sygneto

1,761
1
13
26

-1

votes

1 answer

OSError "is not a valid Win32 application" when using PDFMiner

I am trying to import a bunch of PDFs and build a corpus. I try to use pdfminer but I get an OSError. MY CODE: import os BASE = os.path.join(r"C:\Users\dangeph\Desktop\DataScience\PDFMiner") DOCS = os.path.join(BASE, "data", "docs") def…

python pdfminer

asked May 20 '19 at 20:08

PDa

Prev 1 2 3

…

33 Next