0

I am creating a PDF file without text from a pdf file with text using the following program

def remove_text_from_pdf(pdf_path_in, pdf_path_out): '''Removes the text from the PDF file and saves it as a new PDF file''' #Open the PDF file with the diagram and the text in read mode pdf_file = open(pdf_path_in, 'rb')

#Create a PDF reader and wrtiter object
pdf_reader = PyPDF2.PdfReader(pdf_file) #OLDER VERSION WAS IN USE
pdf_writer = PyPDF2.PdfWriter(pdf_file) #OLDER VERSION 'PDFFILEWRITER' IN USE

#Get the pages from the PDF reader
page = pdf_reader.pages[0]

#Add the pages from the pdf reader to the pdf writer
pdf_writer.add_page(page)

#Remove the text from all pages added to the writer
pdf_writer.remove_text()

#Open the text output file in write mode
out_file = open(pdf_path_out, 'wb')

#Save the information to the text file
pdf_writer.write(out_file)

return

I am converting the output to a png file using the following function

def convert_pdf_to_png(pdf_path, png_path): '''Converts a PDF file to a PNG file''' #Set the image maximum pixels to be none so that it doesn't give a DOS attack error

pdffile = pdf_path
doc = fitz.open(pdffile)
page = doc.load_page(0)  # number of page
pix = page.get_pixmap()
output = png_path
pix.save(output)
doc.close()

but it gives me a png file that is just a blank white copy.

I was expecting a PDF file which is non blank

1 Answers1

0

PyMuPDF lets you do all of the above in one go! No need to use additional packages.

As an aside: You cannot convert a complete PDF with multiple pages into one PNG file. You either must create a new PDF with text-free pages, or create multiple PNG images - one for each page (with text removed).

Here is the code that removes all text from all pages and then saves the resulting PDF under a new name:

import fitz
doc = fitz.open(pdf_file)
for page in doc:
    page.add_redact_annot(page.rect)
    page.apply_redactions(images=fitz.PDF_REDACT_NONE)  # leave images untouched
doc.save("notext-" + pdf_file, garbage=4, deflate=True)  # save under new name
# DONE!

If you instead want page images with no text, do this:

doc = fitz.open(pdf_file)
for page in doc:
    page.add_redact_annot(page.rect)
    page.apply_redactions(images=fitz.PDF_REDACT_NONE)  # leave images untouched
    pix = page.get_pixmap()
    pix.save(f"page-{page.number}.png")
# DONE!
Jorj McKie
  • 2,062
  • 1
  • 13
  • 17