How to parse text from a plain text file and use the result to highlight a PDF file

Question

Back in 2010, some guy claimed to be capable of doing this:

http://www.mobileread.com/forums/showthread.php?t=103847

"The Kindle stores its annotations in a Mobipocket (".mobi") file for each document and in one long text file named "My Clippings.txt." In this post I describe a system that synchronizes these annotations with PDF versions of the corresponding documents on a computer.

Overview

This system is embodied in an Applescript that parses the My Clippings file and controls the Skim PDF reader. The script first parses the clippings file. It then searches through the clippings and isolates any that come from documents on the kindle matching the filename of the currently open PDF file (the "pertinent clippings"). The script then iterates through each of the pertinent clippings, locating the matching text or location in the PDF document and applying highlights or adding notes where appropriate. The end result is an annotated, printable PDF document that matches the document on the kindle.

You can download the script here: http://dl.dropbox.com/u/2541109/KindleClippings.scpt. Before running the script, be sure to change the value of MyEmail to match your sending address and to verify that the Kindle mount point defined in MyClippingsFile is correct. You'll also need the free Skim PDF Reader.

To use it, send or copy a document file to your kindle. Remember, the kindle supports RTF, DOC, TXT and other common text formats and it will convert them into MobiPocket files internally for easier reading. Make some notes. Then take the same document that you just sent to the kindle and convert it to a PDF, e.g. by using the print to PDF feature in Mac OS X. Be sure to keep the filename the same. Open that same PDF in Skim and run the script. The highlights and notes should appear in the PDF.

If you're interested in how this works, read more on my blog here: [not longer available]

Sadly, his script is no longer available, nor his blog.

Do you guys know if this is possible? I've been looking for this kind of functionality but can't find it anywhere.

score 1 · Accepted Answer · answered Jul 06 '22 at 12:09

This code, using python and PyMuPDF, works:

import fitz

# the document to annotate
doc = fitz.open("text_to_highlight.pdf")

# the text to be marked
text_list = [
    "first piece of text", 
    "second piece of text",
    "third piece of text"
        ]

for page in doc:
    for text in text_list:
        rl = page.search_for(text, quads = True)
        page.add_highlight_annot(rl)

# save to a new PDF
doc.save("text_annotated.pdf")

The original 'My Clippings.txt' should be manipulated somehow, stringr could work but I found more useful to manipulate the text with multiple selections in Sublime Text---the goal is to have a list of highlights in the form of text_list above.

score 0 · Answer 2 · edited Mar 20 '17 at 10:04

I am trying to do this using Python + a Windows macro creator (I'm a Win 7 user). You can use this approach to save the file as RTF, DOCX, PDF, etc. So far, it's been reasonably effective. Do note 2 things first:

1- the 'My Clippings' file only saves the text and the page, it does not save the location on the page (e.g., if you highlighted "mammals are animals" on page 15, it will give you this line and the page number, but if there are more than one "mammals are animals" on page 15, it's impossible to know which one you've highlighted). This is specially bad when you've highlighted a generic word, like "animals" or "the". And if you made comments by pressing on a word, this word is the only information you'll get about what in that page the comment refers to (e.g., I pressed on "animals" and the menu popped up, I selected 'Comment'. If "animals" appears 20 times on page 15, I cannot know to which of them my comment is refering).

2- The only way to retrieve the location on the page would be to analyze the *.pds and *.pdt files, inside the *.sdr folder in Kindle's drive ('Documents'). I can make no sense of these files.

In Python, you can run an easy code to extract the information you want from "My Clippings". Then you can use a macro creator to automate the process of copying the text and annotating it to the PDF (using Adobe Acrobat, for example), and then saving the PDF file.

Exemplifying with Adobe Acrobat:

Say I want to save all my highlights to the PDF file. First, I'll create a *.txt file on Python and run a script to copy all the strings related to the highlights to this new txt file (i.e., the highlighted text & the page number). Here's an example of such code (but first, copy and paste the "My Clippings.txt" file to the IDE start folder, e.g.: C:\Python27):

#for python 2.7.6
with open('My Clippings.txt','r') as rf:
    with open('My Clippings Output.txt','w') as wf:
        access = 0
        bookTitle = 'Book Title'#put the book file's name as it's written in "My Clippings.txt"
        for x in rf:
            if access == 1:
                wf.write(x)
            if bookTitle in x: 
                access = 1
            #for highlights only, instead of all annotations, include this if statement:
            if (' | Added on ' in x) and ('- Your Note ' in x) or ('- Your Bookmark ' in x):
                access = 0
            if x == '==========\n':
                access = 0

Then I'll create a macro to copy the page number in the "My Clippings Output.txt" file (it's inside the same folder you put the "My Clippings.txt" file), paste in Acrobat "page window", find (ctrl+f) the string in the page, then press "highlight". Done!

There's a catch in Acrobat though, the search/find function has a limit of ~28 chars, so your highlighted text can't be longer than that. I still don't know how to circumvent this limitation... I raised this problem here https://superuser.com/questions/884221/how-to-search-and-highlight-long-passages-in-a-pdf-file . As a bypass to the 28 chars limit on Acrobat, you can program the macro to copy using "shift"+"right arrow 28 times", and then use "cut" instead of "copy".

There are many free-to-use and libre macro creators out there, just google and choose the one you like best. For Windows, my favorite one is Pulover's Macro Creator. If you have any doubts about the process you can comment here or PM me. I'd prefer you to comment here, so that I can improve the answer

This doesn't seem to answer the question. You say you're doing what the OP wants to do using Python and macros, but you don't provide any Python or macro code, just a list of warnings. — Jeffrey Bosboom, Mar 02 '15 at 01:54
Well, he asks: "Do you guys know if this is possible? I've been looking for this kind of functionality but can't find it anywhere. " I can provide the step-by-step process if he asks this, but the process is so simple to implement that I don't know if it's worth it to write it (the answer would become much, much bigger). By the way, the warnings are related to the feasibility of the process - as I said, it's impossible to get the page location using only the 'My Clippings' file. If the guy from Mobileread promised this he is lying — flen, Mar 02 '15 at 02:05

How to parse text from a plain text file and use the result to highlight a PDF file

2 Answers2

Linked