Questions tagged [pymupdf]

PyMuPDF is a Python binding for MuPDF – “a lightweight PDF and XPS viewer”. MuPDF can access files in PDF, XPS, OpenXPS, CBZ (comic book archive), FB2 and EPUB (e-book) formats. NOTE: It is imported in Python as fitz.

PyMuPDF is a Python binding for – “a lightweight PDF and XPS viewer”.

can access files in PDF, XPS, OpenXPS, CBZ (comic book archive), FB2 and EPUB (e-book) formats.

These are files with extensions .pdf, .xps, .oxps, .cbz, .fb2 or .epub (so you can develop e-book viewers in Python).

PyMuPDF provides access to many important functions of MuPDF from within a Python environment.

Note on the Name fitz:

The standard Python import statement for this library is import fitz. This has a historical reason.

257 questions
0
votes
0 answers

How to add a hyperlink next to the word in PDF and create a new pdf with hyperlinks appended

import fitz # PyMuPDF library # Load the PDF document pdf_path = "./FAQ.pdf" pdf_document = fitz.open(pdf_path) # Initialize a dictionary to store text and hyperlink pairs text_with_links = {} # Iterate through the pages of the PDF document for…
nithin
  • 753
  • 3
  • 7
  • 21
0
votes
0 answers

How to retrieve page numbers of TOC(table of contents) from a PDF in python

I have managed to retrieve the page number of the page from where the toc(table of contents) starts in a PDF. This works great if the toc is of specifically of 1 page. But now I am unable to come up with any good logic if there is a multi page toc…
0
votes
1 answer

Python - Fitz pdf Skimmer - Question on how to return a sentences with keywords

I'm in the process of creating a pdf skimmer that reads a legal document, searches for keywords, returns the individual sentences that the keywords are apart of, then updates a checklist based on the conditions of the returned sentences. All the…
0
votes
1 answer

Struggling to get fitz text extraction to work when passing a clip argument

I'm currently writing a python script to convert pdfs to audiobooks and im trying to use a border to remove page numbers and other unwanted titles. Here is my current code for this (gTTS will be changed to a better library eventually): import…
0
votes
0 answers

Permission denied when writing content created with PyMuPdf to temporary PDF file

I'm working on a Python script that uses the PyMuPDF library to modify a PDF document and then save the modified content to a temporary PDF file. However, I'm encountering a "Permission denied" error when trying to write the content of the pdf file…
Mazze
  • 383
  • 3
  • 13
0
votes
0 answers

Is there any way to to compress large size pdf file using only python library no external .exe

I have 100mb pdf file of single page with color full text of different font multiple image. Is there a way to compress this pdf to minimum size and then decompress it back to original size with same image and text quality but using only python and…
Kedar17
  • 178
  • 2
  • 14
0
votes
1 answer

How to save a fitz.Page as bytes (later to be uploaded to azure blob)?

I have a fitz.fitz.Page object that I wanted to save as bytes that I later want to upload to blob storage and have been unable to find out this in the fitz documentation I do have the code to upload bytes to the blob storage though and just need…
newbie101
  • 65
  • 7
0
votes
0 answers

Python - Handling text anomalies from .pdf files for NLP

I need to automate cleaning procedure in text loaded from .pdf. this is currently what the issue is being about and heres the code i'm using to clean pdf def clean_text(text): # Remove additional whitespaces and newlines using regex …
Vandalism
  • 23
  • 4
0
votes
1 answer

Python requirements.txt missing package after running pipreqs

I'm using PyMuPDF in a Flask application and also in some standalone scripts. I'm trying to update my requirements.txt to include the proper PyMuPDF package I'm using but using the Context Action in Pycharm, the Sync requirements.txt option in the…
DFW
  • 805
  • 1
  • 8
  • 18
0
votes
1 answer

Orientation issue in PDF with ocrMyPDF and AWS Textract

I have working code that uses AWS Textract to perform OCR in PDFs, and generally have no issues with alignment. But in a recent test document, the redactions performed show up exactly 90 degrees rotated in relation to the PDF image. So far I've been…
REJ
  • 1
  • 2
0
votes
0 answers

text is seen after masking on hovering the mask position when using pymupdf

Am masking email id in PDF using pymupdf. When i open the file there is no hyperlink seen under mail id but when the text is extracted hyperlink is seen.Due to that after when i mask mail id and hover on that masking am able to see the mail id.Had…
0
votes
0 answers

insert text in pdf using pre used fonts in pdf using python pymupdf

i am trying to insert text in pdf using pre used fonts in pdf. $ import fitz $ doc = fitz.open('input.pdf') $ page = doc[0] $ doc.extract_font() -> ('invalid-name', '', '', b'') $ doc.get_page_fonts(0) -> [(6, 'ttf', 'Type0',…
0
votes
1 answer

How to edit pdf from azure blob storage without downloading it locally? (using Fitz)

I have a pdf that is already in the blob storage. I need to highlight few lines in it and store it as a new pdf (again in blob storage). I tried finding it in the links below but couldn't. Below is the pseudo code: import fitz def…
newbie101
  • 65
  • 7
0
votes
0 answers

How to retrieve geometry from a pdf particular layer (OCG)?

Is there another way to retrieve geometry from a pdf particular layer using fitz, except get_cdrawings()? I've tried to use get_cdrawings, but the value for "layer" is always empty. import fitz from fitz import Document, Page with…
Gennady K
  • 1
  • 2
0
votes
1 answer

Adding data object XML to PDF using PyMuPDF

I am struggling to add a data object to a PDF using PyMuPDF. I am successful adding a PDF as an embedded file but I can not add an XML file. I am trying using the following function : embfile_add. The embedded XML file will be used to get data into…
Camilo
  • 335
  • 5
  • 7