Questions tagged [pymupdf]

PyMuPDF is a Python binding for MuPDF – “a lightweight PDF and XPS viewer”. MuPDF can access files in PDF, XPS, OpenXPS, CBZ (comic book archive), FB2 and EPUB (e-book) formats. NOTE: It is imported in Python as fitz.

PyMuPDF is a Python binding for – “a lightweight PDF and XPS viewer”.

can access files in PDF, XPS, OpenXPS, CBZ (comic book archive), FB2 and EPUB (e-book) formats.

These are files with extensions .pdf, .xps, .oxps, .cbz, .fb2 or .epub (so you can develop e-book viewers in Python).

PyMuPDF provides access to many important functions of MuPDF from within a Python environment.

Note on the Name fitz:

The standard Python import statement for this library is import fitz. This has a historical reason.

257 questions
4
votes
1 answer

How to get background color of a Text in PyMuPDF

Am trying to see if I can identify possible table headers in a table inside PDF using background and foreground color of the text. With PyMuPDF text extraction, I was able to get the foreground color. Wondering if there is a way to get background…
Suvin K S
  • 229
  • 2
  • 8
3
votes
1 answer

ways to separate passages in pdf using gap?

I have some pdf's with 2-3 passages for every page. every passage is separated by some line gap, but while reading with pymupdf, I cannot see any machine printable separator between passages. is there any other way, other library can do…
3
votes
0 answers

How to use fitz (PyMuPDF) with py2app or pyinstaller [ModuleNotFoundError]?

I want to convert my python script which contains a pdf to image converter to a .app file on MacOS, and be able to run this on a different machine. I have tried both pyinstaller and py2app and get the following error message: Traceback (most recent…
Bryan_Koh
  • 31
  • 1
3
votes
1 answer

Why can't i extract correctly the image from this pdf? [Please need help]

I am currently working on OCR on pdf files. Here is my pipeline: i first extract image from pdf (since my pdf contained scanned document) and convert in numpy array then i read with tesseract It works pretty well on most of my image but i have…
curious
  • 201
  • 1
  • 10
3
votes
0 answers

Capture screenshot from pdf page

I have a pdf document and this page has an image of a graph plot, however legend of the plot is not part of the image. I am using pymupdf to extract get this image as following: for img in doc.getPageImageList(page_num, full=True): xref =…
CuriousBug
  • 243
  • 1
  • 3
  • 16
3
votes
4 answers

Convert PDF file to multipage image

I'm trying to convert a multipage PDF file to image with PyMuPDF: pdffile = "input.pdf" doc = fitz.open(pdffile) page = doc.loadPage() # number of page pix = page.getPixmap() output = "output.tif" pix.writePNG(output) But I need to convert all the…
David Delos
  • 41
  • 1
  • 3
3
votes
0 answers

Decoding problem with fitz.Document in Python 3.7

I want to extract the text of a PDF and use some regular expressions to filter for information. I am coding in Python 3.7.4 using fitz for parsing the pdf. The PDF is written in German. My code looks as follows: doc = fitz.open(pdfpath) pagecount =…
Riprip
  • 41
  • 1
  • 4
3
votes
3 answers

adding text to a pdf using PyMuPDF

I'm trying to add text to a pdf by opening the PDF, adding a text box, and saving it. When I run the code, nothing happens. on the desktop, it shows the file has been updated, but there is no text displayed on it. Here's the code: import fitz doc =…
Khayla Black
  • 31
  • 1
  • 2
3
votes
2 answers

Can't read the content of a certain page of a pdf file available online

I've used PyMuPDF library to parse the content of any specific page of a pdf file locally and found it working. However, when I try to apply the same logic while parsing the content of any specific page of a pdf file available online, I encounter an…
MITHU
  • 113
  • 3
  • 12
  • 41
2
votes
2 answers

I am having an import error with the fitz library in PyCharm

I am having this issue of importing the fitz library in PyCharm. I pip installed PyMuPDF and in my code I added "import fitz" but it is giving me this error: ImportError:…
jjboi8708
  • 65
  • 1
  • 7
2
votes
1 answer

Keywords being highlighted in wrong color using PyMuPDF

I'm doing some basic keyword highlighting, but I'm running into a strange issue. When I enter a stroke color with floating point RGB values (as shown below), the highlights come out in multiple different colors. In this case, I want the highlights…
almosthavoc
  • 159
  • 10
2
votes
1 answer

RTL (Arabic) ligatures problem when extracting text from PDF

When extracting Arabic text from a PDF file using librairies like PyMuPDF or PDFMiner, the words are returned in backward order which is a normal behavior for RTL languages, and you need to use bidi algorithm to be able to display it correctly…
Naourass Derouichi
  • 773
  • 3
  • 12
  • 38
2
votes
1 answer

How to add a border to hyperlink with Fitz module?

I spent three hours experimenting this morning on this but I can't manage to make the border visible on a hyperlink within a pdf annotated with the python FITZ module. Any idea ? Thanks so much in advance ! import fitz doc =…
2
votes
0 answers

Data Wrangling of text extracted from PDF using PyMuPDF possible? (alternating colors for each row) - text positioned in the middle for each row

I extracted data from PDF file. I am sharing a sample of the page here. I extracted data from the PDF using Tabula-py. These are the arguments I used to extract the text from PDF page. import numpy as np import pandas as pd from tabula.io import…
Joe
  • 91
  • 6
2
votes
0 answers

How to spread text on multiple pages depending on text size?

What I tried doc = fitz.open() page = doc.new_page() text = 'Long text' tw = fitz.TextWriter(page.rect) tw.append((20,40), text, small_caps=True) tw.write_text(page) doc.ez_save('test.pdf') How to spread text on multiple pages depending on text…
Zurechtweiser
  • 1,165
  • 2
  • 16
  • 29
1
2
3
17 18