Redacted / highlighted PDF becomes too big with this script. Can it be improved?

Question

A few years ago I asked this question. I wanted to extract my Kindle annotations from the MyClippings.txt file and use them to annotate a PDF version of the original text. Very useful for academic reading (e.g., having the annotated original PDF is more useful for skimming and citing). A few months ago I found a solution in the following script.

import fitz

# the document to annotate
doc = fitz.open("text_to_highlight.pdf")

# the text to be marked
text_list = [
    "first piece of text", 
    "second piece of text",
    "third piece of text"
        ]

for page in doc:
    for text in text_list:
        rl = page.search_for(text, quads = True)
        page.add_highlight_annot(rl)

# save to a new PDF
doc.save("text_annotated.pdf")

I found however a new problem since then. The PDF output, on a 700 pages book, becomes incredibly big (more than 500M). (The script had to be run a few times,because with all the annotations at once it would crash; this is not necessarily a problem but it suggests inefficiency). Is there an approach---my guess is Python-based---that could prevent such inefficient outcome?

Yes, I think I am duplicating the add-ons, that is what causes the final PDF to be so big. Consider I am trying to transfer highlighted texts from EPUBs into PDFs, at least 300 lines inside `text_list` in the example above — Ramiro, May 20 '23 at 02:29

Life is complex · Answer 1 · 2023-05-24T11:36:51.767

2

There are several unknowns in your question:

How big is the original PDF? Kbs? Mbs?
What language is the book?
How many annotations are you adding?
What is the complexity of the items in your text list?

To fully diagnose your issue it would be useful to have access to the PDF and a couple of search items in your text list. There is a possibly that your text searches are too broad and might require using something like blocks = page.getText("dict", flags=flags)["blocks"] or something else.

Below is rough code that might help.

Note that I used """Triple Quotes""", because your text list might be causing your crashes. The list might contain multi-line strings and strings that contain quotes themselves.

I also believe that would be beneficial in your document save function to do some housekeeping by using these parameters:

garbage=4, > garbage collect unused objects, compact :data:xref tables and merge duplicate objects and streams
deflate=True -> deflate uncompressed streams
clean=True -> clean and sanitize content streams


import fitz

doc = fitz.open("text_to_highlight.pdf")

text_list = [
   """first piece of text""", 
   """second piece of text""",
   """third piece of text"""
   ]
try:
  for page in doc:
      for text in text_list:
          rl = page.search_for(text, quads = True)
          page.add_highlight_annot(rl[0])
          print(f"Added {text} annots on page {page.number}.")
except Exception as e:
      print(e)

finally:
   doc.save("text_annotated.pdf", garbage=4, deflate=True, clean=True)

edited May 24 '23 at 11:36

answered May 20 '23 at 13:34

Life is complex

15,374
5
29
58

I will try this approach and share a suitable PDF file and long text lists to replicate the problem (I am sadly far from my files and computer for a few days). Will renew the bounty if needed. Thanks for your help! – Ramiro May 25 '23 at 14:58
Ok. Can you tell me the size of the original PDF document? – Life is complex May 25 '23 at 15:01
I am on a Chromebook-with-linux-machine for a few days, so my computing power is really bad. But I am running some test. Interesting find: with an original PDF of 5.9MB the output of running the original code against three lines of text is 79.1MB. – Ramiro May 25 '23 at 15:47
Three annotations caused this much increase in the document size? – Life is complex May 25 '23 at 16:34
Weird. I will run a few tests with other docs to see if the pattern repeats itself. – Ramiro May 25 '23 at 17:39
The size of the file will increase, but there is no standard increase percentage. It can be substantial at times. – Life is complex May 26 '23 at 03:55

score 2 · Answer 2 · edited May 26 '23 at 11:14

One explanation (and it's hard to know without seeing a file) is that the repeated annotating is somehow duplicating objects within the PDF file. If you run cpdf -squeeze in.pdf -o out.pdf, that will coalesce any duplicated objects. If you can't provide the file, do post the output of cpdf and it might give useful information.

Here are the reference links for the cpdf binaries and the Python documentation.

score 0 · Accepted Answer · answered May 29 '23 at 23:27

So, in case anybody gets here and is interested in this functionality, let me share the workflow and the Code (slightly changed / improved from the one above, but basically the same). Uesful when you've read in ePub but want to save your notes in a PDF for better skimming when doing research.

Purpose

To highlight a PDF using the MyClippings.txt file produced by the Kindle.

Steps

First, we need to extract from MyClippings the portions of text of the PDF we want to highlight. Fairly easy procedure, done manually. We can save the (rather long) lines in original_long_lines.txt.

This is not enough: we want to cut those long lines into approximately five-words bits (for otherwise the search PDF function will not work properly). For that purpose we run the following Code (check input_file and output_file and name accordingly).

def break_lines(input_file, output_file):
    with open(input_file, 'r') as file:
        lines = file.readlines()

    output_lines = []
    for line in lines:
        words = line.split()
        if len(words) >= 3:
            # Break line into new lines with a maximum of five words
            for i in range(0, len(words), 5):
                output_line = ' '.join(words[i:i+7])
                output_lines.append(output_line)

    with open(output_file, 'w') as file:
        file.write('\n'.join(output_lines))

    print(f"Output written to: {output_file}")


# Example usage
input_file = 'original_long_lines.txt'
output_file = 'shorter_lines.txt'
break_lines(input_file, output_file)

Second, that is not enough either: you want to cut the lines where you only have one or two words (to prevent highlighting those two words all the time in the PDF). For that purpose, we use the following code:

def join_lines(input_file, output_file):
    with open(input_file, 'r') as file:
        lines = file.readlines()

    output_lines = []
    prev_line = ''

    for line in lines:
        words = line.split()
        if len(words) <= 2:
            prev_line += ' ' + line.strip()
        else:
            output_lines.append(prev_line.strip())
            prev_line = line.strip()

    # Add the last line to the output
    output_lines.append(prev_line.strip())

    with open(output_file, 'w') as file:
        file.write('\n'.join(output_lines))

    print(f"Output written to: {output_file}")


# Example usage
input_file = 'shorter_lines.txt'
output_file = 'shorter_lines_no_one_or_two_words.txt'
join_lines(input_file, output_file)

And finally, we use the following code to highlight the PDF using our shorter_lines_no_one_or_two_words.txt text file.

import PyPDF2
import fitz
from tqdm import tqdm

def highlight_pdf(pdf_path, text_file):
    # Load the list of strings from the text file
    with open(text_file, 'r') as file:
        search_strings = file.read().splitlines()

    # Open the PDF file
    pdf = fitz.open(pdf_path)

    # Initialize the progress bar
    progress_bar = tqdm(total=len(pdf), unit='page')

    for page_num in range(len(pdf)):
        page = pdf[page_num]
        for search_string in search_strings:
            text_instances = page.search_for(search_string, quads=True)
            for inst in text_instances:
                # Highlight the found text
                highlight = page.add_highlight_annot(inst)

        # Update the progress bar after processing each page
        progress_bar.update(1)

    # Close the progress bar
    progress_bar.close()

    # Save the modified PDF
    output_path = 'highlighted_' + pdf_path
    pdf.save(output_path)
    pdf.close()

    print(f"Highlighted PDF saved as: {output_path}")


# Example usage
pdf_path = 'your_pdf.pdf'
text_file = 'shorter_lines_no_one_or_two_words.txt'
highlight_pdf(pdf_path, text_file)

In my experience, this sometimes increases the size of the final file exponentially, and sometimes it does not. This problem can be easily be solved using cpdf as mentioned by John Whitington above, as in cpdf -squeeze huge_pdf.pdf -o small_pdf.pdf. And now you have your Kindle highlights in your PDF.

score -1 · Answer 4 · edited Dec 07 '22 at 16:10

-1

Try this

import fitz

# the document to annotate
doc = fitz.open("text_to_highlight.pdf")

# the text to be marked
text_list = [
    "first piece of text", 
    "second piece of text",
    "third piece of text"
    ]

for page in doc:
    for text in text_list:
        rl = page.search_for(text, quads = True)
        page.add_highlight_annot(rl)


doc.save("text_annotated.pdf")

edited Dec 07 '22 at 16:10

Paul Brennan

2,638
4
19
26

answered Dec 07 '22 at 14:54

Raj chaturvedi

1

Redacted / highlighted PDF becomes too big with this script. Can it be improved?

4 Answers4

Purpose

Steps