0

In a docx, I have several paragraphs which are marked as inserted, deleted or modified with track changes.

Python-docx does not saw them when I use the function Document.paragraphs() as say in the documentation:

@property
def paragraphs(self):
    """
    A list of |Paragraph| instances corresponding to the paragraphs in
    the document, in document order. Note that paragraphs within revision
    marks such as ``<w:ins>`` or ``<w:del>`` do not appear in this list.
    """
    return self._body.paragraphs

Their is a possibility to use this function and get the revised paragraphs ?

KarSho
  • 5,699
  • 13
  • 45
  • 78
  • No. The class documentation even literally says it does not support this, so why ask if you can *anyway*. You would need to add a *significant* proportion of code to `python-docx`. – Jongware Jul 16 '18 at 12:42
  • 1
    Their is no work around ? I found this https://stackoverflow.com/questions/38247251/how-to-extract-text-inserted-with-track-changes-in-python-docx but it becomes old and I hope that the situation evolved since ... –  Jul 16 '18 at 12:53
  • Updated solution here: https://stackoverflow.com/a/75766006/2893408 – caram Mar 17 '23 at 11:49

1 Answers1

0

Borrowing from https://stackoverflow.com/a/56933021/10846829 we can parse the document using its XML schema. Deleted text is marked with the "delText" tag while inserted text is marked with the "ins" tag and all other text can be accessed using the text tag "t".

from docx import Document
try:
    from xml.etree.cElementTree import XML
except ImportError:
    from xml.etree.ElementTree import XML

word_body = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"
text_tag  = word_body + "t"
del_tag   = word_body + "delText"
ins_tag   = word_body + "ins"

def accept_all(p):
    """Return text of a paragraph after accepting all changes"""
    xml = p._p.xml
    if "w:del" in xml or "w:ins" in xml:
        tree = XML(xml)
        runs = (node.text for node in tree.getiterator(text_tag) if node.text)
        return "".join(runs)
    else:
        return p.text

def reject_all(p):
    """Return text of a paragraph after rejecting all changes"""
    xml = p._p.xml
    if "w:del" in xml or "w:ins" in xml:
        tree = XML(xml)
        # find and remove all insert tags
        for insert in tree.findall(ins_tag):
            tree.remove(insert)
        # find all deletion tags and change to text tags
        for deletion in tree.getiterator(del_tag):
            deletion.tag = text_tag
        runs = (node.text for node in tree.getiterator(text_tag) if node.text)
        return "".join(runs)
    else:
        return p.text

doc = Document("Hello.docx")

for p in doc.paragraphs:
    print("---")
    print(p.text)
    print("---")
    print(accept_all(p))
    print("=========")
    print(reject_all(p))
    print("=========")
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – lemon Jul 10 '22 at 23:24