-1

I am trying to work with a ~300 page odt document. I know how to load documents in python, and least in a basic way. That didn't work for odt (it isn't a txt file). I researched this and installed the odfpy library, although it doesn't seem well-documented. I'm able to get it to the point where I have an array of it. But I don't know how trying to use regex across multiple array entries would work. So I tried to convert it with "str()" to a string, and all I got was a long list of addresses.

I want to be able to load up an odt document and run a regex to remove a certain pattern from it. How do I go about doing this ...? So far, what I've been trying doesn't work. I'd like to maintain the structure of the odt intact. I'm more used to txt.

import sys
import re
from odf.opendocument import load
from odf import text, teletype
infile = load(r'C:\Users\Iainc\Documents\Blah Blah.odt')
allparas = infile.getElementsByType(text.P)
stringallparas = str(allparas)

This is, so far, what I have that, I believe, is successful. But certain things that would work with .txt aren't working.

1 Answers1

0

Something like the following might work. Replace 'Your pattern here' with the regex pattern to replace.

import sys
import re
from odf.opendocument import load
from odf import text, teletype
infile = load(r'C:\Users\Iainc\Documents\Blah Blah.odt')
for item in infile.getElementsByType(text.P):
    s = teletype.extractText(item)
    m = re.sub(r'Your pattern here', '', s)
    if m != s:
        new_item = text.P()
        new_item.setAttribute('stylename', item.getAttribute('stylename'))
        new_item.addText(m)
        item.parentNode.insertBefore(new_item, item)
        item.parentNode.removeChild(item)

infile.save('result.odt')

The for loop in this code was taken from ReplaceOneTextToAnother on the odfpy wiki and slightly modified to use re.sub instead of str.replace and text.P instead of text.Span.

Nathan Mills
  • 2,243
  • 2
  • 9
  • 15
  • When I do this, it returns an error, though. ... infile.save(r'C:\Users\Iainc\Documents\The Seventh Story 2.odt') File "", line 10 infile.save(r'C:\Users\Iainc\Documents\The Seventh Story 2.odt') ^^^^^^ SyntaxError: invalid syntax Why is that? – Iain Curtis-Shanley Jan 08 '22 at 15:31
  • I added a blank line before the `save` call to fix the SyntaxError, so the code should work now. I think this error happens because Python's repl (read-eval-print-loop) expects an indented line. Adding a blank line tells Python that the current indented block has ended. See the following link for an explanation of why the blank line is necessary. [Why am I getting an invalid syntax error in Python REPL right after IF statement?](https://stackoverflow.com/a/50901962/8890345) – Nathan Mills Jan 08 '22 at 21:24
  • Yes, now it does run ... however, my regex catches ALMOST all of the "[(" and ")]", reducing it from about 4600 to about 90 ... because there is a ")]", then a line break, and then directly below it, a "[(" ... if change it from "." to "[\s\S]", that would solve that...? ... No, it didn't reduce it at all, beyond the original version ... why? Isn't that suppose to take of the remaining ones? – Iain Curtis-Shanley Jan 09 '22 at 04:21
  • One other addition to the above mentioned issue: there are 3603 cases of "[(" in the document, but only 3601 case of ")]". That means, twice, assumably, I forgot to close a "[(" with a ")]". Is that going to create havoc with the code, and cause it to delete things it isn't supposed to? – Iain Curtis-Shanley Jan 09 '22 at 04:39
  • You can use the `re.DOTALL` flag to make `.` match newlines. Something like `re.sub(r'Your pattern here', '', s, 0, re.DOTALL)`. A count of zero replaces all occurrences. You might want to check the document and make sure all the braces are matched. The code might delete stuff it isn't supposed to if the braces aren't matched so you should probably make a backup of the document before running the code. – Nathan Mills Jan 09 '22 at 21:29