Questions tagged [pdfminer]

A python-based tool for extracting information from PDF documents.

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

Features

  • Written entirely in Python. (for version 2.4 or newer)
  • Parse, analyze, and convert PDF documents.
  • PDF-1.7 specification support. (well, almost)
  • CJK languages and vertical writing scripts support.
  • Various font types (Type1, TrueType, Type3, and CID) support.
  • Basic encryption (RC4) support.
  • PDF to HTML conversion (with a sample converter web app).
  • Outline (TOC) extraction.
  • Tagged contents extraction.
  • Reconstruct the original layout by grouping text chunks.

PDFMiner is about 20 times slower than other C/C++-based counterparts such as XPdf.

(source)

492 questions
0
votes
1 answer

How to use pdfMiner in python to predicatbly read values

I've been using pdfMiner to read values off of graphs and so far its been working great! However there is one area in which the correct data is read correctly but in an unpredictable manner, meaning it will read all the graphs values correctly, in a…
Jeff
  • 21
  • 1
  • 3
0
votes
2 answers

python - pull pdfs from webpage and convert to html

My goal is to have a python script that will access particular webpages, extract all pdf files on each page that have a certain word in their filename, convert them into html/xml, then go through the html files to read data from the pdfs' tables. So…
maniciam
  • 365
  • 5
  • 10
-1
votes
1 answer

How to visualize bounding boxes extracted from pdfminer.six?

I have a diagram in a PDF format. I am using pdfminer.six to extract the text present in the diagram as well as the bounding boxes of the text. Everything is fine so far. System info: Windows 10, Python 3.9.13 Now I want to draw these bounding…
tintin98
  • 91
  • 9
-1
votes
1 answer

New error but no code changes! - TypeError: '<=' not supported

I am an inexperienced developer cobbling things together but no deep knowledge and I have come unstuck! I have a Google Cloud Function running some code that uses pdfmier.six to extract content from a PDF and that has been working well. However I…
-1
votes
1 answer

Extracting images from a PDF using PyPDF2 - but the pdf has no metadata

The PDF is a scanned image, so there is no way I have found yet, to pull out the images. I have tried methods including crop and media boxes, but it pulls the entire pages as images. I have also tried other parsing libraries like pdfminer.six, but…
-1
votes
2 answers

Extract metadata info from online pdf using pdfminer in python

I am interested to find out some metadata of an online pdf using pdfminer. I am interested in extracting info such as Title, author, no of lines etc from the pdf I am trying to use a related solution discussed…
-1
votes
1 answer

I cannot find a way to extract underlined text, cant it be done with pdfminer.six?

I am trying to extract a text in pdf which is underlined using python but not able to find a correct solution can anyone help on this, please
-1
votes
1 answer

How to use pdfminer3 to iterate through multiple PDF files in a directory

I am trying to iterate through many PDF files to extract their text and place them into an excel file. pdfminer3 has allowed me to do so with only one PDF file but I am having trouble with iterating through many PDF files. from pdfminer3.layout…
Leeee
  • 1
-1
votes
1 answer

How to flip a pdf page upside down using python?

I'm trying to flip pdf pages upside down using python. I have tried multiple libraries like PyPdf2, PyMuPDF and pdfminer. There is documentation on how to rotate a page, but that is not what I'm looking for. The closest solution I found was on one…
Ajay Alex
  • 21
  • 3
-1
votes
1 answer

How to distinguish uploaded PDFs to extract data through regular expression in python Django

Here are uploaded pdfs and it will convert it into text. After converting into text I use a regular expression to get some specific data from the pdfs. Now there are various kinds of pdfs and I have to use different types of regular expression for…
zenvar
  • 19
  • 8
-1
votes
1 answer

Extract text from multiple PDFs and write to a single CSV

I want to loop through all the PDFs in a directory, extract the text from each one using PDFminer, and then write the output to a single CSV file. I am able to extract the text from each PDF individually by passing it to the function defined here. I…
b00kgrrl
  • 559
  • 2
  • 9
  • 30
-1
votes
1 answer

Extract Numbers from a certain location in PDF files

I'm trying to write a script to extract numbers from the "Total Deviation" graph in pdf files that looks like this. The reason I am trying to extract the information from the location of the graph rather than parsing the whole file and filtering it…
Pen Gerald
  • 31
  • 5
-1
votes
2 answers

PDFMiner TypeError: not all arguments converted during string formatting

I have been trying to come up with a solution to parse a PDF into an HTML so, later I'll use beautiful soup to extract all the headings, subitems and paragraph respectively in a tree structure. I have searched a few options available on the internet…
Ali Asad
  • 1,235
  • 1
  • 18
  • 33
-1
votes
1 answer

PDF miner, bad new line detection

I am using this code to get text data from PDF : def pdf_to_txt(path): manager = PDFResourceManager() retstr = BytesIO() layout = LAParams(all_texts=True) device = TextConverter(manager, retstr, laparams=layout) filepath =…
sygneto
  • 1,761
  • 1
  • 13
  • 26
-1
votes
1 answer

OSError "is not a valid Win32 application" when using PDFMiner

I am trying to import a bunch of PDFs and build a corpus. I try to use pdfminer but I get an OSError. MY CODE: import os BASE = os.path.join(r"C:\Users\dangeph\Desktop\DataScience\PDFMiner") DOCS = os.path.join(BASE, "data", "docs") def…
PDa
  • 11
  • 1
  • 1
1 2 3
32
33