Questions tagged [pymupdf]

PyMuPDF is a Python binding for MuPDF – “a lightweight PDF and XPS viewer”. MuPDF can access files in PDF, XPS, OpenXPS, CBZ (comic book archive), FB2 and EPUB (e-book) formats. NOTE: It is imported in Python as fitz.

PyMuPDF is a Python binding for – “a lightweight PDF and XPS viewer”.

can access files in PDF, XPS, OpenXPS, CBZ (comic book archive), FB2 and EPUB (e-book) formats.

These are files with extensions .pdf, .xps, .oxps, .cbz, .fb2 or .epub (so you can develop e-book viewers in Python).

PyMuPDF provides access to many important functions of MuPDF from within a Python environment.

Note on the Name fitz:

The standard Python import statement for this library is import fitz. This has a historical reason.

257 questions
1
vote
1 answer

PyMuPdf Bookmarks

I have a script that combines a bunch of PDFs into a single file, using PyPDF2, all good but on the company network is really slow. I then tried PyMuPdf and it is 100 times faster, but bookmarks and metadata are not copied automatically. Is there an…
1
vote
1 answer

List matches of page.search_for() with PyMuPDF

I'm writing a script to highlight text from a list of quotes in a PDF. The quotes are in the list text_list. I use this code to highlight the text in the PDF: import fitz #Load Document doc = fitz.open(filename) #Iterate over pages for page in…
SamVimes
  • 39
  • 7
1
vote
1 answer

How to extract anchor text/ words from every hyperlinks from pdf using python?

I am trying to extract hyperlink present in each page with their anchor text from pdf using PymuPdf library. I am able to extract hyperlinks with their page numbers but couldn't able to extract anchor text/words for every hyperlinks. Can anyone help…
1
vote
0 answers

Maintaining the sequence of the extracted text and images from the PDF while scrapping them in python

I am trying to extract text and images from a pdf using python using the library PyMuPdf. But unfortunately, I can't preserve the sequence of the image. for example, the Image is placed at the start of the page but while extracting it, the image is…
1
vote
1 answer

Extract all Images from PDF with Python, and retain their transparency

I see a number of solutions on the web and here for extracting images from a PDF with PyMuPDF, PyPDF2, and others, but none them successfully retain transparency information, are using deprecated code that no longer works, or the questions have gone…
Chris Valentine
  • 1,557
  • 1
  • 19
  • 36
1
vote
1 answer

Highlight numbers in a PDF using Python

I was able to highlight words in a PDF (using the below code). However, I would also like to highlight any number contained in the same PDF. How would you complement such code? import fitz # opening the pdf file my_pdf =…
CelloRibeiro
  • 160
  • 11
1
vote
2 answers

python pymupdf - How to write something into a pdf form field (widget)

I'm using pymupdf and just trying to write some text into an already existing pdf form field (widget). I was able to identify the widget by its xref, and read its contents, but I don't know how to modify its field_value and save it back. I've tried…
Max Iskram
  • 147
  • 10
1
vote
1 answer

installing PyMuPDF in python 3.8 alpine

I am trying to install PyMuPDF in the official Python 3.8 alpine docker image. The dockerfile is like this: FROM python:3.8-alpine RUN apk add --update --no-cache \ gcc g++ \ libc-dev \ python3-dev \ build-base \ cairo-dev \ …
Raiyan
  • 1,589
  • 1
  • 14
  • 28
1
vote
0 answers

How to press a button on a PDF form with Python?

I have a situation where I need to fill out a PDF form and then press a button in it (I need to press "Send" button and this sends the data to the system). From what I understand, pressing the button executes a JavaScript script on the form. I'm…
1
vote
0 answers

How does one get the exact coordinates of text after running PyMuPDF search for?

Suppose I run PyMuPDF's search for function: import fitz doc = fitz.Document(pdf_path) page = doc[pg] coords = page.search_for('foo', quads=True) First off, is this guaranteed to be the exact, minimal bounding rectangle of the text matched? -- I…
Chris
  • 28,822
  • 27
  • 83
  • 158
1
vote
1 answer

How to Data Extract from Unstructured PDFs using PyMuPDF in python?

I am following this guide on how to extract data from Unstructured PDFs using PyMuPDF. https://www.analyticsvidhya.com/blog/2021/06/data-extraction-from-unstructured-pdfs/ I am getting an AttributeError: 'NoneType' object has no attribute 'rect'…
shuynh84
  • 59
  • 8
1
vote
1 answer

How to extract only certain table from the pdf (invoice) which contains multiple tables in the structure format

How to extract only one table from a pdf which contains multiple tables. I have tried using AmazonTextract but the problem is it gives me all the tables belonging to that pdf in a csv. But I need to extract only certain tables based on some…
Jyoti yadav
  • 108
  • 6
1
vote
1 answer

How to get a file path using tkinter askopenfilename or other command?

I'm building a simple app, where it converts pdf to png. When I use: pdf_name = askopenfilenames(initialdir="/", title="Selecionar Arquivos") I get: print(pdf_name) ('C:/Users/user/Desktop/Apps/Python/Conversor img to pdf/file.pdf',) So, the ask…
1
vote
0 answers

Convert PDF to HTML via PyMuPDF

For pages with tabular data in landscape format, the words in the HTML outcome overlap. For pages in portrait formats, the conversion is succesful. Any ideas how to fix that? [Here is an example with the converted pdf to html in landscape…
1
vote
1 answer

Case-sensitive PDF highlighting using PyMuPDF and re

The goal is a program that can take a PDF of a script as well as the name of a character and output a script with only that character's lines (or at least their name) highlighted. An example of the way these scripts are typically formatted: Here I…
deep_node
  • 23
  • 4