Regex not specific enough

Question

So I wrote a program for my Kindle e-reader that searches my highlights and deletes repetitive text (it's usually information about the book title, author, page number, etc.). I thought it was functional but sometimes there would random be periods (.) on certain lines of the output. At first I thought the program was just buggy but then I realized that the regex I'm using to match the books title and author was also matching any sentence that ended in brackets.

This is the code for the regex that I'm using to detect the books title and author

titleRegex = re.compile('(.+)\((.+)\)')

Example

Desired book title and author match: Book title (Author name)
What would also get matched: *I like apples because they are green (they are sometimes red as well). *

In this case it would delete everything and leave just the period at the end of the sentence. This is obviously not ideal because it deletes the text I highlighted

Here is the unformatted text file that goes into my program

The program works by finding all of the matches for the regexes I wrote, looping through those matches and one by one replacing them with empty strings.

Would there be any ways to make my title regex more specific so that it only picks up author titles and not full sentences that end in brackets? If not, what steps would I have to take to restructure this program?

I've attached my code to the bottom of this post. I would greatly appreciate any help as I'm a total coding newbie. Thanks :)

import re
titleRegex = re.compile('(.+)\((.+)\)')
titleRegex2 = re.compile(r'\ufeff (.+)\((.+)\)')
infoRegex = re.compile(r'(.) ([a-zA-Z]+) (Highlight|Bookmark|Note) ([a-zA-Z]+) ([a-zA-Z]+) ([0-9]+) (\|)')
locationRegex = re.compile(r' Location (\d+)(-\d+)? (\|)')
dateRegex = re.compile(r'([a-zA-Z]+) ([a-zA-Z]+) ([a-zA-Z]+), ([a-zA-Z]+) ([0-9]+), ([0-9]+)')
timeRegex = re.compile(r'([0-9]+):([0-9]+):([0-9]+) (AM|PM)')
newlineRegex = re.compile(r'\n')
sepRegex = re.compile('==========')

regexList = [titleRegex, titleRegex2, infoRegex, locationRegex, dateRegex, timeRegex, sepRegex, newlineRegex]

string = open("/Users/devinnagami/myclippings.txt").read()

for x in range(len(regexList)):
    newString = re.sub(regexList[x], ' ', string)
    string = newString

finalText = newString.split('             ')

with open('booknotes.txt', 'w') as f:
    for item in finalText:
        f.write('%s\n' % item)

What distinguishes *I like apples because they are green* from a valid book title? (I'd have to say "nothing whatsoever".) What distinguishes *they are sometimes red as well* from a valid author name? If you can't come up with a foolproof rule for at least one of those, then your problem is not solvable. — jasonharper, Jul 22 '21 at 23:23
You could do a multiline match from "^=====" through the second newline. — Tim Roberts, Jul 22 '21 at 23:25

score 0 · Accepted Answer · answered Jul 23 '21 at 00:58

There isn't enough information to tell if "Book title (Book Author)" is different than something like "I like Books (Good Ones)" without context. Thankfully, the text you showed has plenty of context. Instead of creating several different regular expressions, you can combine them into one expression to encode that context.

For instance:

quoteInfoRegex = re.compile(
    r"^=+\n(?P<title>.*?) \((?P<author>.*?)\)\n" + 
    r"- Your Highlight on page (?P<page>[\d]+) \| Location (?P<location>[\d-]+) \| Added on (?P<added>.*?)\n" + 
    r"\n" + 
    r"(?P<quote>.*?)\n", flags=re.MULTILINE)

for m in quoteInfoRegex.finditer(data):
    print(m.groupdict())

This will pull out each line of the text, and parse it, knowing that the book title is the first line after the equals, and the quote itself is below that.

Regex not specific enough

1 Answers1