How do I resolve this error in saving a new odt after a regex?

Question

I've been trying to find good documentation to solve this ... but from what I can see from what little documentation, this code should have worked ... I'm rather curious as to why this isn't working, but I'm certainly not an expert.

>>> import sys
>>> import re
>>> from odf.opendocument import load
>>> from odf import text, teletype
>>> infile = load(r'C:\Users\Iainc\Documents\The Seventh Story.odt')
>>> for item in infile.getElementsByType(text.P):
...     s = teletype.extractText(item)
...     m = re.sub(r'\[\((?:(?!\[\().)*?\)\]', '', s);
...     if m != s:
...             new_item = text.P()
...             new_item.setAttribute('stylename', item.getAttribute('stylename'))
...             new_item.addText(m)
...             item.parentNode.insertBefore(new_item, item)
...             item.parentNode.removeChild(item)
... infile.save(r'C:\Users\Iainc\Documents\The Seventh Story 2.odt')
  File "<stdin>", line 10
    infile.save(r'C:\Users\Iainc\Documents\The Seventh Story 2.odt')
    ^^^^^^
SyntaxError: invalid syntax

This is supposed to go through a document full of multiple nested notes (ex, "[(blah blah [(blah [(blah (blah) blah)] )] blah )]") and remove all the notes, only leaving the text before the first "[(" or after the last ")]". I think this code should work to do this, so far as I can tell, but why the error? And I'm not certain even the filter is quite working as it should.

Why did you put `;` at the end of the `m = re.sub(r'\[\((?:(?!\[\().)*?\)\]', '', s);` line? Remove it. — Wiktor Stribiżew, Jan 08 '22 at 22:37
Okay, I just did! And if I place a line before "infile.save", the last line, then it does run ... however, my regex catches ALMOST all of the "[(" and ")]", reducing it from about 4600 to about 90 ... does my regex not catch it ... ah, because there is a ")]", then a line break, and then directly below it, a "[(" ... if I change it from "." to "[\s\S]", that would solve that...? — Iain Curtis-Shanley, Jan 09 '22 at 04:13
It will solve that. Or just use `re.S` or `re.DOTALL` option. To remove all nested occurrences, run in a loop until there is no match. — Wiktor Stribiżew, Jan 09 '22 at 11:52
I tried both "re.S" and "re.DOTALL". Strangely, the exact same number of these things remain, and in the exact same position. — Iain Curtis-Shanley, Jan 09 '22 at 20:42

score 0 · Answer 1 · answered Jan 09 '22 at 10:40

I don't know why you are getting the SyntaxError, but to remove all the notes while leaving the text between each group of nested notes, re.sub will probably need to be called repeatedly in a loop.

Your regex matches from [( to the first occurence of )] that follows it, but not if [( appears again between them. This has the effect of matching the innermost note of each group of nested notes, which is then substituted for the empty string to remove it.

To match across line endings you're going to need the re.DOTALL flag or to put (?s) at the start of the regex, or to use a match-any-character class like [\S\s] instead of .

For example:

import re

text = '''
beginning [(blah blah [(blah [(blah (blah) blah)] )] blah 
blah (blah) blah blah )] middle [(blah blah [(blah [(blah
(blah) blah)] )] blah blah (blah) blah blah )] end
'''

t = ''
while t != text:
   t = text
   text = re.sub(r'\[\((?:(?!\[\().)*?\)\]', '', text, flags=re.DOTALL)
   
print(text)    
# beginning  middle  end

When I run your method, I get "Traceback (most recent call last): File "", line 3, in File "C:\Python310\lib\re.py", line 209, in sub return _compile(pattern, flags).sub(repl, string, count) TypeError: expected string or bytes-like object". — Iain Curtis-Shanley, Jan 09 '22 at 20:38
And if I just take my original code, but put in "flags=re.DOTALL" ... the same number of "[(" notes (ie, "[( blah blah)]"), and in the same position (directly after a line break), still exist ... — Iain Curtis-Shanley, Jan 09 '22 at 20:39
Yes, you seem to get some strange errors don't you, as I can see from your question. The simple code above works fine, as can be easily tested by trying it into one of the many online python IDEs. — MikeM, Jan 10 '22 at 00:42

How do I resolve this error in saving a new odt after a regex?

1 Answers1