Questions tagged [pymupdf]

PyMuPDF is a Python binding for MuPDF – “a lightweight PDF and XPS viewer”. MuPDF can access files in PDF, XPS, OpenXPS, CBZ (comic book archive), FB2 and EPUB (e-book) formats. NOTE: It is imported in Python as fitz.

PyMuPDF is a Python binding for – “a lightweight PDF and XPS viewer”.

can access files in PDF, XPS, OpenXPS, CBZ (comic book archive), FB2 and EPUB (e-book) formats.

These are files with extensions .pdf, .xps, .oxps, .cbz, .fb2 or .epub (so you can develop e-book viewers in Python).

PyMuPDF provides access to many important functions of MuPDF from within a Python environment.

Note on the Name fitz:

The standard Python import statement for this library is import fitz. This has a historical reason.

257 questions
1
vote
1 answer

Developing a generalized logic of getting highlighted area from multiple pdfs into pandas dataframe using python

I have created a solution using python which extracts highlighted portions from the PDF using pymupdf and fitz. This is the code for the same. def _parse_highlight(annot: fitz.Annot, wordlist: List[Tuple[float, float, float, float, str, int, int,…
technophile_3
  • 531
  • 6
  • 21
1
vote
0 answers

Programmatically change printer setting for each page in pdf file

I'am using python 3.10 and win32api to send print job to printer, I could change somes settings (set tray) before printing and it works fine, the probleme is : I couldn't update setting for each page, I browse pdf using pymupdf but it seems there is…
khelili miliana
  • 3,730
  • 2
  • 15
  • 28
1
vote
0 answers

How to remove text layer from pdf using python

I need to remove all text information from pdf file. So the file I wanna get should be like scan: only images wrapped as pdf, no texts that u can copy or select. Now I'm using ghostscript command: import os ... os.system(f"gs -o {output_path}…
Demetry Pascal
  • 383
  • 4
  • 15
1
vote
1 answer

Overlay 2 pdf files by each page using pymupdf

I need to combine (merge/overlay) 2 pdf files like second on first by each page. I've tried the code import fitz doc1 = fitz.open(background) doc2 = fitz.open(only_text_path) doc1.insertPDF(doc2) but it only concatenates doc1 + doc2, doesn't…
Demetry Pascal
  • 383
  • 4
  • 15
1
vote
1 answer

How to highlight multiple keywords in a .pdf file using PyMuPDF library

I am able to highlight all the occurrences of a single word in .pdf file using this but unable to highlight multiple keywords in .pdf file. Here's my code import fitz import os keywords = ["remote","setup"] pdfFile = "\D:\Python_Scripts\Email…
Amir Khan
  • 17
  • 1
  • 7
1
vote
0 answers

Extract GPA from Resume through Python Using PyMyPDF

We made a program for simple Resume that extract the whole Resume Info in string line by line. Now I want to extract the GPA from that string. I tried a lot but could not get any idea regard this. So if anyone could configure this will be very…
1
vote
1 answer

python - read pdf ignoring header and footer

I have a pdf file that I am reading using pymupdf using the below syntax. import fitz # this is pymupdf with fitz.open('file.pdf') as doc: text = "" for page in doc: text += page.getText() Is there a way to ignore the header and…
1
vote
1 answer

how to extract text from a selection of pages in a larger pdf using pymupdf?

I know there are many libraries to extract text from PDF. Specifically, I've been having some difficulty with pymupdf. From the documentation here: https://pymupdf.readthedocs.io/en/latest/app4.html#sequencetypes I was hoping to use select() to pick…
Katie Melosto
  • 1,047
  • 2
  • 14
  • 35
1
vote
1 answer

How to find table grid lines in PDF files?

To more accurately extract table-like data embedded within table cells, I would like to be able to identify table cell boundaries in PDFs like this: I have tried extracting such tables using Camelot, pdfplumber, and PyMuPDF, with varying degrees of…
1
vote
2 answers

Is there any way that I can identify whether the PDF is edited/tampered and the exact location where the PDF is edited/tampered using Python?

I am working on identifying forgery/tampering in bank statements PDF documents. Info metadata and XMP metadata is not always present in the PDFs that I have so I am not able to create any generalized rule to identify tampered PDFs. I am using Python…
1
vote
2 answers

selecting the exact match using pymupdf-page.searchFor()

Below is a piece of my code, where I'm searching for a particular word & extracting their coordinates. As per the documentation page.searchFor(), page.searchFor(needle, hit_max=16, quads=False, flags=None). Searches for needle on a page. Upper/lower…
RevolverRakk
  • 309
  • 4
  • 10
1
vote
1 answer

Using python PyMuPDF (fitz) to iterate through lines and check length of it and add a period if it meets the criteria

Trying to iterate through each line of the page from the PyMuPDF library to check the length of the sentence, if it is less than 10 words then I would like to add a full stop. Psuedo code would be: #loop through the lines of the PDF #check number of…
user11464178
1
vote
0 answers

How to save different versions of a single pdf, with different highlights, PyMuPDF, Python?

I have a pdf document and for simplicity, I want to make two (many) different edited versions of the same pdf. For example, in one of the pdf, I want all the "and" in the pdf to be highlighted, and in the second I want all "the" to be highlighted. I…
yoyo yoyo
  • 21
  • 1
  • 3
1
vote
1 answer

How to Highlight a specific line/text in a pdf using Python

I am new to python and have been working on a project to make a new pdf with highlighted text. I am using pymupdf to get the text and am storing the text, font size, and the index of the text. I found a way to highlight the text but it searches and…
yoyo yoyo
  • 21
  • 1
  • 3
1
vote
2 answers

Python PyMuPDF looping next pages

I'm using below code to open a PDF file and convert into an image file as output. Now, i'm trying to figure it out how can I loop the next page and convert it as same output file. Any help is much appreciated! # display image on the canvas def…
faizal_a
  • 93
  • 4
  • 15