Main Goal:My main goal of this side project is to make a script that can read all the files in a Google drive identify all the pdfs and compress the Pdf file to take less space,The below is how far i have got.
I have a Python script that uses the PyMuPDF library to compress PDF files. The script reads all PDF files from a folder, compresses them using the default settings, and saves the compressed files to an output folder.
Here's the relevant code:
from google.colab import drive
drive.mount('/content/drive')
input_folder = '/content/drive/MyDrive/pdf_folder/'
output_folder = '/content/drive/MyDrive/Target Folder/'
!pip install 'PyPDF2<3.0'
!pip install PyMuPDF
import os
import glob
import fitz
def get_size_format(b, factor=1024, suffix="B"):
for unit in ["", "K", "M", "G", "T", "P", "E", "Z"]:
if b < factor:
return f"{b:.2f}{unit}{suffix}"
b /= factor
return f"{b:.2f}Y{suffix}"
def compress_file(input_file: str, output_file: str):
"""Compress PDF file"""
if not output_file:
output_file = input_file
initial_size = os.path.getsize(input_file)
try:
doc = fitz.open(input_file)
# Optimize PDF with the default settings
doc.save(output_file, deflate=True)
doc.close()
except Exception as e:
print("Error compress_file=", e)
return False
compressed_size = os.path.getsize(output_file)
ratio = 1 - (compressed_size / initial_size)
summary = {
"Input File": input_file, "Initial Size": get_size_format(initial_size),
"Output File": output_file, f"Compressed Size": get_size_format(compressed_size),
"Compression Ratio": "{0:.3%}.".format(ratio)
}
# Printing Summary
print("## Summary ########################################################")
print("\n".join("{}:{}".format(i, j) for i, j in summary.items()))
print("###################################################################\n\n")
return True
if __name__ == "__main__":
# input_folder = sys.argv[1]
# output_folder = sys.argv[2]
# Create output folder if it does not exist
if not os.path.exists(output_folder):
os.makedirs(output_folder)
# Find all PDF files in input folder
pdf_files = glob.glob(os.path.join(input_folder, "*.pdf"))
# Compress each PDF file and save to output folder
for pdf_file in pdf_files:
output_file = os.path.join(output_folder, os.path.basename(pdf_file))
compress_file(pdf_file, output_file)
The code works fine, but I would like to improve the compression quality. How can I achieve that? Are there any additional options or parameters I can pass to doc.save() to improve the compression quality? Or should I use a different library or approach altogether?
**NOTE: You cannot use PYPDF module because the method used to compress it is depriciated the tried using a lower version 2.12 and it still does not work you can try fixing it, I have given the whole code of the project so feel free to Copy the code and use it in your own work flow ** I am Using Google Colab for this Project. Sample Output:
## Summary ########################################################
Input File:/content/drive/MyDrive/pdf_folder/3-1 Regular .pdf
Initial Size:198.16KB
Output File:/content/drive/MyDrive/Target Folder/3-1 Regular .pdf
Compressed Size:197.20KB
Compression Ratio:0.487%.
###################################################################
## Summary ########################################################
Input File:/content/drive/MyDrive/pdf_folder/CN ASSIGNMENT 2.pdf
Initial Size:2.37MB
Output File:/content/drive/MyDrive/Target Folder/CN ASSIGNMENT 2.pdf
Compressed Size:2.37MB
Compression Ratio:-0.000%.
###################################################################
## Summary ########################################################
Input File:/content/drive/MyDrive/pdf_folder/PCS Answers.pdf
Initial Size:6.10MB
Output File:/content/drive/MyDrive/Target Folder/PCS Answers.pdf
Compressed Size:6.10MB
Compression Ratio:0.019%.
###################################################################
## Summary ########################################################
Input File:/content/drive/MyDrive/pdf_folder/NLP All units.pdf
Initial Size:6.55MB
Output File:/content/drive/MyDrive/Target Folder/NLP All units.pdf
Compressed Size:6.81MB
Compression Ratio:-3.929%.
###################################################################
As you can see that is not enought compression, Please try to Improve it, Any help or guidance would be greatly appreciated. Thank you!