1

Main Goal:My main goal of this side project is to make a script that can read all the files in a Google drive identify all the pdfs and compress the Pdf file to take less space,The below is how far i have got.

I have a Python script that uses the PyMuPDF library to compress PDF files. The script reads all PDF files from a folder, compresses them using the default settings, and saves the compressed files to an output folder.

Here's the relevant code:


from google.colab import drive
drive.mount('/content/drive')



input_folder = '/content/drive/MyDrive/pdf_folder/'
output_folder = '/content/drive/MyDrive/Target Folder/'


!pip install 'PyPDF2<3.0'
!pip install PyMuPDF



import os
import glob
import fitz

def get_size_format(b, factor=1024, suffix="B"):
   
    for unit in ["", "K", "M", "G", "T", "P", "E", "Z"]:
        if b < factor:
            return f"{b:.2f}{unit}{suffix}"
        b /= factor
    return f"{b:.2f}Y{suffix}"

def compress_file(input_file: str, output_file: str):
    """Compress PDF file"""
    if not output_file:
        output_file = input_file
    initial_size = os.path.getsize(input_file)
    try:
        doc = fitz.open(input_file)
        # Optimize PDF with the default settings
        doc.save(output_file, deflate=True)
        doc.close()
    except Exception as e:
        print("Error compress_file=", e)
        return False
    compressed_size = os.path.getsize(output_file)
    ratio = 1 - (compressed_size / initial_size)
    summary = {
        "Input File": input_file, "Initial Size": get_size_format(initial_size),
        "Output File": output_file, f"Compressed Size": get_size_format(compressed_size),
        "Compression Ratio": "{0:.3%}.".format(ratio)
    }
    # Printing Summary
    print("## Summary ########################################################")
    print("\n".join("{}:{}".format(i, j) for i, j in summary.items()))
    print("###################################################################\n\n")
    return True

if __name__ == "__main__":
    # input_folder = sys.argv[1]
    # output_folder = sys.argv[2]
  
    # Create output folder if it does not exist
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)
    # Find all PDF files in input folder
    pdf_files = glob.glob(os.path.join(input_folder, "*.pdf"))
    # Compress each PDF file and save to output folder
    for pdf_file in pdf_files:
        output_file = os.path.join(output_folder, os.path.basename(pdf_file))
        compress_file(pdf_file, output_file)

The code works fine, but I would like to improve the compression quality. How can I achieve that? Are there any additional options or parameters I can pass to doc.save() to improve the compression quality? Or should I use a different library or approach altogether?

**NOTE: You cannot use PYPDF module because the method used to compress it is depriciated the tried using a lower version 2.12 and it still does not work you can try fixing it, I have given the whole code of the project so feel free to Copy the code and use it in your own work flow ** I am Using Google Colab for this Project. Sample Output:

## Summary ########################################################
Input File:/content/drive/MyDrive/pdf_folder/3-1 Regular .pdf
Initial Size:198.16KB
Output File:/content/drive/MyDrive/Target Folder/3-1 Regular .pdf
Compressed Size:197.20KB
Compression Ratio:0.487%.
###################################################################


## Summary ########################################################
Input File:/content/drive/MyDrive/pdf_folder/CN  ASSIGNMENT 2.pdf
Initial Size:2.37MB
Output File:/content/drive/MyDrive/Target Folder/CN  ASSIGNMENT 2.pdf
Compressed Size:2.37MB
Compression Ratio:-0.000%.
###################################################################


## Summary ########################################################
Input File:/content/drive/MyDrive/pdf_folder/PCS Answers.pdf
Initial Size:6.10MB
Output File:/content/drive/MyDrive/Target Folder/PCS Answers.pdf
Compressed Size:6.10MB
Compression Ratio:0.019%.
###################################################################


## Summary ########################################################
Input File:/content/drive/MyDrive/pdf_folder/NLP All units.pdf
Initial Size:6.55MB
Output File:/content/drive/MyDrive/Target Folder/NLP All units.pdf
Compressed Size:6.81MB
Compression Ratio:-3.929%.
###################################################################

As you can see that is not enought compression, Please try to Improve it, Any help or guidance would be greatly appreciated. Thank you!

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
  • 1
    You can try to also remove potential dead objects in the file by using garbage collection: `doc.save(filename, deflate=True, garbage=4)`. the highest option 4 consolidates the XREF table, deletes unused and duplicate objects and object streams. That's about all what's possible. – Jorj McKie Apr 13 '23 at 10:28
  • `PyPDF2` is deprecated. Use `pypdf` – Martin Thoma Apr 15 '23 at 08:09
  • If you have PDF with embedded fonts (or subsets), one can use the default with PDF provided fonts. – Joop Eggen Apr 15 '23 at 08:31

0 Answers0