0

Using PyMuPDF, I want to extract all images from pdf and save them separately and replace all images in pdf with just their image names at the same image place and save as another document. I can save all images with following code.

import fitz
#This creates the Document object doc
doc = fitz.open("Article_Example_1_2.pdf")
html_text=""
for i in range(len(doc)):
    print(doc[i]._getContents())
    for img in doc.getPageImageList(i):
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)
        if pix.n - pix.alpha < 4:       # this is GRAY or RGB   or pix.n < 5
            pix.writePNG("p%s-%s.png" % (i, xref))
        else:               # CMYK: convert to RGB first
            pix1 = fitz.Pixmap(fitz.csRGB, pix)
            pix1.writePNG("p%s-%s.png" % (i, xref))
            pix1 = None
        pix = None

doc.save(filename=r"new.pdf")

doc.close()

but not sure how to replace them all in pdf with their stored images names. Would greatly appreciate if anyone can help me out here.

Mohammad Ahmed
  • 57
  • 1
  • 1
  • 6

1 Answers1

1

Message from the repo maintainer:

I am not sure whether we have discussed this in the issue blog of the repo. What you can do is using the new feature "redaction annotation". Basic approach:

  1. Calculate the bbox of each image via Page.getImageBbox().
  2. Add a redaction annotation via Page.addRedactAnnot(bbox, text=filename, ...).
  3. When finished with the page, execute Page.apply_redactions(). This will remove all images and all redactions. The chosen filename will appear in the former image bbox.
  4. Save as a new document.

Make sure to use PyMuPDF v1.17.0 or later.

Jorj McKie
  • 2,062
  • 1
  • 13
  • 17