Delete text from pdf using PyMUPDF

Question

I need to remove the text "DRAFT" from a pdf document using Python. I can find the text box containing the text but can't find an example of how to edit the pdf text element using pymupdf.

In the example below the draft object contains the coords and text for the DRAFT text element.

import fitz

fname = r"original.pdf"
doc = fitz.open(fname)
page = doc.load_page(0)

draft = page.search_for("DRAFT")

# insert code here to delete the DRAFT text or replace it with an empty string

out_fname = r"final.pdf"
doc.save(out_fname)

Added 4/28/2022 I found a way to delete the text but unfortunately it also deletes any overlapping text underneath the box around DRAFT. I really just want to delete the DRAFT letters without modifying underlying layers

# insert code here to delete the DRAFT text or replace it with an empty string
rl = page.search_for("DRAFT", quads = True)
page.add_redact_annot(rl[0])

page.apply_redactions()

In this case, a map exported from ArcGIS Pro, the Draft is just a horizontal text element overlaid over other text. I'm not sure what anylyser is — user3005422, Apr 28 '22 at 19:05

xiaoxu · Answer 1 · 2022-09-26T08:26:25.003

1

You can try this.

import fitz

doc = fitz.open("xxxx")

for page in doc:
    for xref in page.get_contents():
        stream = doc.xref_stream(xref).replace(b'The string to delete', b'')
        doc.update_stream(xref, stream)

edited Sep 26 '22 at 08:26

answered Sep 26 '22 at 08:25

xiaoxu

21
2

1

It will be better if you can explain in a few words what your code is doing. – Harish Talanki Sep 29 '22 at 22:11
3

For anyone else who gets here. This didn't work for my use-case. I have a diagonal "draft" text that is overlaid over the document that I need to remove. The above solution works to delete horizontal text. – haredev Oct 28 '22 at 03:01

score 0 · Answer 2 · answered May 09 '23 at 14:52

This is example how to manipulate PDF page strings by modifying draw commands (Tj operator). Bellow example just removes any draw string command from the page. Replacing in some cases may be done by simple bytes.replace(), but in some cases it may be non trivial task, since there is posibility that each character may be separated command and they even may be not in "human visible" order.

# more about text operators:
# https://www.syncfusion.com/succinctly-free-ebooks/pdf/text-operators
def remove_tj(self, page: fitz.Page):
    doc: fitz.Document = page.parent
    
    xref_page = page.xref
    if xref_page == 0:
      raise RuntimeError("page xref is zero")
    
    props = doc.xref_get_keys(xref_page)
    if 'Contents' not in props:
      raise RuntimeError("no 'Contents' key in page dict")
    
    content = doc.xref_get_key(xref_page, 'Contents')
    
    if content[0] == 'xref':
      if content[1].endswith(' 0 R'):
        contents_xref = int(content[1][:-4]) # 'contents' is referance to other xref
      else:
        raise RuntimeError('PDF struct issue #2')
    else:
      raise RuntimeError('PDF struct issue #1')
    
    if not doc.xref_is_stream(contents_xref):
      raise RuntimeError('PDF struct issue #3')
    
    # page content commands stream (commands are sepparated by ASCII '\r'):
    cmds: 'list[bytes]' = doc.xref_stream(contents_xref).split(b'\r')
    
    i = 0
    while i < len(cmds):
      if cmds[i].endswith(b' Tj'): # draw string operator
        print(cmds[i][1:-4]) # string usually is in brackets:  ( characters may contain \x hex encoded values) Tj
        # here you can manipulate text bytes
        # words may be split into few Tj operator fragments
        cmds.pop(i) # for example this will remove any text operator from the page
      else:
        i += 1
    
    doc.update_stream(contents_xref, b'\r'.join(cmds), new=0, compress=1)

Delete text from pdf using PyMUPDF

2 Answers2