Questions tagged [pymupdf]

PyMuPDF is a Python binding for MuPDF – “a lightweight PDF and XPS viewer”. MuPDF can access files in PDF, XPS, OpenXPS, CBZ (comic book archive), FB2 and EPUB (e-book) formats. NOTE: It is imported in Python as fitz.

PyMuPDF is a Python binding for – “a lightweight PDF and XPS viewer”.

can access files in PDF, XPS, OpenXPS, CBZ (comic book archive), FB2 and EPUB (e-book) formats.

These are files with extensions .pdf, .xps, .oxps, .cbz, .fb2 or .epub (so you can develop e-book viewers in Python).

PyMuPDF provides access to many important functions of MuPDF from within a Python environment.

Note on the Name fitz:

The standard Python import statement for this library is import fitz. This has a historical reason.

257 questions
2
votes
1 answer

Python scraping an unstructured PDF

We get bi weekly software releases from a supplier who provides us with PDF release notes. The notes have got a lot of irrelevant stuff in them, but ultimately we need to go and manually copy/paste information from these notes into a Confluence…
Isaac
  • 27
  • 4
2
votes
1 answer

Extracting complete hyperlink string from PDF using PyMuPDF

I'm trying to extract every single link from a PDF. I'm able to get every single hyperlink using this code: folder = "test_folder" folder_data = [os.path.join(dp, f) for dp, dn, filenames in os.walk(folder) for f in filenames if…
jorge gill
  • 21
  • 2
2
votes
1 answer

How to auto resize QVBoxLayout according to its child contents inside a QScrollArea?

Recently, I am trying to use PyQT5 to make a PDF viewer. I adapted the code provided in this post (Image Viewer GUI fails to properly map coordinates for mouse press event). I created a QScrollArea that contains a QVBoxLayout in order to dynamically…
ps2pspgood
  • 61
  • 7
2
votes
2 answers

Extract text in a rectangle from pdf - Python

I have a requirement that to extract a text which in a rectangle from Pdf. There are several methods I have tested. But not getting specific text. For example I tested with PyMuPDF, pdfplumber, tabula, camelot, pdftables packages. In PyMuPDF module…
Kamaal Shaik
  • 57
  • 1
  • 9
2
votes
2 answers

Extract images of pdf with pymupdf in right order

I am currently working on an Python 3.x image extractor for pdf-files and can't seem to find a solution for the problem I have been facing throughout my work. My intention is to extract all the images of pdf-files (vehicle reports) without the logos…
Jani
  • 107
  • 1
  • 3
  • 9
2
votes
0 answers

How can I determine whether a PDF page contains redacted material?

I have a set of PDFs, for which some pages have had partial contents redacted through Adobe Acrobat. I would like to programmatically iterate through each page and determine whether the page contains redacted content, preferably using Python (note…
crkm
  • 39
  • 3
2
votes
2 answers

How do I access the text from a specific pdf page rather than the entire document

I am trying to extract some stuff from some pdf documents. I have been mucking around with various tools though I have invested the most in pdfminer and pymupdf. I started with pdfminer but started testing pymupdf after not being able to address…
PyNEwbie
  • 4,882
  • 4
  • 38
  • 86
1
vote
0 answers

How to handle ligature issue while using pdf text

I need to capture some text from some PDFs. I use PymuPDF to do this. But facing ligature issue while writing those selected text inside a text file. I use the following code snippet to read the PDF pdf = fitz.open("file_path") full_text = "" for…
1
vote
1 answer

How to match placement,font,style and size of replaced text with search text in PDF files using Python?

I'm using Python and the PyMuPDF library to search for and replace text in PDF files. Its working properly but colored text replace in style does not get how to fix it? Here's the code I'm currently using: import os import fitz # Prompt user for…
Hetul
  • 11
  • 1
1
vote
1 answer

PyMuPdf extract pdf information into a csv file, from multiple files. Why is this code only extracting data from the first page of each PDF?

I am trying to extract specific information from every PDF file in a folder into a single CSV file. Each PDF has the information across multiple pages. However something is wrong with my loop or how it is implemented and I am not quite sure why. The…
J D
  • 11
  • 2
1
vote
0 answers

How to highlight a blob of text using PyMupdf

so, I have a pdf file. I am reading it via the PyMuPDF package. I read the text and break the text into chunks. So for the below text screenshot in one of the pages of the original pdf, I get the text read as below: The text I have in…
Baktaawar
  • 7,086
  • 24
  • 81
  • 149
1
vote
0 answers

How can I improve the PDF compression quality in my Python code using the PyMuPDF library?

Main Goal:My main goal of this side project is to make a script that can read all the files in a Google drive identify all the pdfs and compress the Pdf file to take less space,The below is how far i have got. I have a Python script that uses the…
1
vote
1 answer

Recognizing drop caps in PDF in python

I'm currently using pymupdf to extract text blocks from a file in python. import fitz doc = fitz.open(filename) for page in doc: text = page.get_text("blocks") for item in text: print(item[4]) The problem is that drop caps are…
Esraa Abdelmaksoud
  • 1,307
  • 12
  • 25
1
vote
1 answer

How can I edit/modify/replace text in an existing PDF file?

I am working on my final year project, so I working on a website where a user can come and read PDF. I am adding some features such as converting currency to their country currency. I am using flask and pymuPDF for my project and I don't know how I…
1
vote
0 answers

How can I either ignore blank pages in a pdf using python or add blank pages to a location without changing the total amnt of pages until doc saved?

So I'm using the tkinter and pymupdf libraries to add blank pages to a desired location. This is done by pressing a button which inserts the blank page below the page on the button. My issue is that once it inserts the blank page, the original page…