How can I improve the PDF compression quality in my Python code using the PyMuPDF library?

Question

Main Goal:My main goal of this side project is to make a script that can read all the files in a Google drive identify all the pdfs and compress the Pdf file to take less space,The below is how far i have got.

I have a Python script that uses the PyMuPDF library to compress PDF files. The script reads all PDF files from a folder, compresses them using the default settings, and saves the compressed files to an output folder.

Here's the relevant code:


from google.colab import drive
drive.mount('/content/drive')



input_folder = '/content/drive/MyDrive/pdf_folder/'
output_folder = '/content/drive/MyDrive/Target Folder/'


!pip install 'PyPDF2<3.0'
!pip install PyMuPDF



import os
import glob
import fitz

def get_size_format(b, factor=1024, suffix="B"):
   
    for unit in ["", "K", "M", "G", "T", "P", "E", "Z"]:
        if b < factor:
            return f"{b:.2f}{unit}{suffix}"
        b /= factor
    return f"{b:.2f}Y{suffix}"

def compress_file(input_file: str, output_file: str):
    """Compress PDF file"""
    if not output_file:
        output_file = input_file
    initial_size = os.path.getsize(input_file)
    try:
        doc = fitz.open(input_file)
        # Optimize PDF with the default settings
        doc.save(output_file, deflate=True)
        doc.close()
    except Exception as e:
        print("Error compress_file=", e)
        return False
    compressed_size = os.path.getsize(output_file)
    ratio = 1 - (compressed_size / initial_size)
    summary = {
        "Input File": input_file, "Initial Size": get_size_format(initial_size),
        "Output File": output_file, f"Compressed Size": get_size_format(compressed_size),
        "Compression Ratio": "{0:.3%}.".format(ratio)
    }
    # Printing Summary
    print("## Summary ########################################################")
    print("\n".join("{}:{}".format(i, j) for i, j in summary.items()))
    print("###################################################################\n\n")
    return True

if __name__ == "__main__":
    # input_folder = sys.argv[1]
    # output_folder = sys.argv[2]
  
    # Create output folder if it does not exist
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)
    # Find all PDF files in input folder
    pdf_files = glob.glob(os.path.join(input_folder, "*.pdf"))
    # Compress each PDF file and save to output folder
    for pdf_file in pdf_files:
        output_file = os.path.join(output_folder, os.path.basename(pdf_file))
        compress_file(pdf_file, output_file)

The code works fine, but I would like to improve the compression quality. How can I achieve that? Are there any additional options or parameters I can pass to doc.save() to improve the compression quality? Or should I use a different library or approach altogether?

**NOTE: You cannot use PYPDF module because the method used to compress it is depriciated the tried using a lower version 2.12 and it still does not work you can try fixing it, I have given the whole code of the project so feel free to Copy the code and use it in your own work flow ** I am Using Google Colab for this Project. Sample Output:

## Summary ########################################################
Input File:/content/drive/MyDrive/pdf_folder/3-1 Regular .pdf
Initial Size:198.16KB
Output File:/content/drive/MyDrive/Target Folder/3-1 Regular .pdf
Compressed Size:197.20KB
Compression Ratio:0.487%.
###################################################################


## Summary ########################################################
Input File:/content/drive/MyDrive/pdf_folder/CN  ASSIGNMENT 2.pdf
Initial Size:2.37MB
Output File:/content/drive/MyDrive/Target Folder/CN  ASSIGNMENT 2.pdf
Compressed Size:2.37MB
Compression Ratio:-0.000%.
###################################################################


## Summary ########################################################
Input File:/content/drive/MyDrive/pdf_folder/PCS Answers.pdf
Initial Size:6.10MB
Output File:/content/drive/MyDrive/Target Folder/PCS Answers.pdf
Compressed Size:6.10MB
Compression Ratio:0.019%.
###################################################################


## Summary ########################################################
Input File:/content/drive/MyDrive/pdf_folder/NLP All units.pdf
Initial Size:6.55MB
Output File:/content/drive/MyDrive/Target Folder/NLP All units.pdf
Compressed Size:6.81MB
Compression Ratio:-3.929%.
###################################################################

As you can see that is not enought compression, Please try to Improve it, Any help or guidance would be greatly appreciated. Thank you!

You can try to also remove potential dead objects in the file by using garbage collection: `doc.save(filename, deflate=True, garbage=4)`. the highest option 4 consolidates the XREF table, deletes unused and duplicate objects and object streams. That's about all what's possible. — Jorj McKie, Apr 13 '23 at 10:28
If you have PDF with embedded fonts (or subsets), one can use the default with PDF provided fonts. — Joop Eggen, Apr 15 '23 at 08:31

How can I improve the PDF compression quality in my Python code using the PyMuPDF library?

0 Answers0