1

I need to remove all text information from pdf file. So the file I wanna get should be like scan: only images wrapped as pdf, no texts that u can copy or select. Now I'm using ghostscript command:

import os
...
os.system(f"gs -o {output_path} -sDEVICE=pdfwrite -dFILTERTEXT {input_path}")

unfortunately, with some documents it removes not only text layer but real pixels of characters!!! And sometimes I cannot see any text pictures on the page, it's not what I need

Is there some stable and fast solution with python or pip utils? It will be wonderful if I can solve this with PyMuPDF (fitz) but I couldn't find anything about it

Demetry Pascal
  • 383
  • 4
  • 15
  • 2
    Ghostscript (reasonably recent versions) include 3 devices; pdfimage8,24 and 32 which produce a rendered bitmap of the input, wrapped up as a PDF file. Yes, filtertext removes text, that's exactly what it's supposed to do! You can also use -dNoOutputFonts which turns fonts into vectors, so that there is no text in the output. – KenS Nov 09 '21 at 08:07
  • 2
    *"with some documents it removes not only text layer but real pixels of characters!!!"* - there are different scan/OCR strategies. The simple ones simply add invisible text under or over the image. More elaborate ones cut the letters from the scanned image, create ad-hoc fonts from them, and add *visible text* with these ad-hoc fonts in front of the *bitmap image with the cut-out letters*. The advantage of such scans is that the PDF becomes editable to a degree. The disadvantage is that processes that require the whole bitmap with background and letters have to re-render it, likely lossy. – mkl Nov 09 '21 at 08:35
  • Thank u guys, I got the problem – Demetry Pascal Nov 09 '21 at 11:32

0 Answers0