4

I want to extract text from word documents that were edited in "Track Changes" mode. I want to extract the inserted text and ignore the deleted text.

Running the below code I saw that paragraphs inserted in "track changes" mode return an empty Paragraph.text

import docx

doc = docx.Document('C:\\test track changes.docx')

for para in doc.paragraphs:
    print(para)
    print(para.text)

Is there a way to retrieve the text in revisioned inserts (w:ins elements) ?

I'm using python-docx 0.8.6, lxml 3.4.0, python 3.4, Win7

Thanks

Jim Simson
  • 2,774
  • 3
  • 22
  • 30
yiftah
  • 71
  • 1
  • 6

5 Answers5

4

I was having the same problem for years (maybe as long as this question existed).

By looking at the code of "etienned" posted by @yiftah and the attributes of Paragraph, I have found a solution to retrieve the text after accepting the changes.

The trick was to get p._p.xml to get the XML of the paragraph and then using "etienned" code on that (i.e retrieving all the <w:t> elements from the XML code, which contains both regular runs and <w:ins> blocks).

Hope it can help the souls lost like I was:

from docx import Document

try:
    from xml.etree.cElementTree import XML
except ImportError:
    from xml.etree.ElementTree import XML


WORD_NAMESPACE = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"
TEXT = WORD_NAMESPACE + "t"


def get_accepted_text(p):
    """Return text of a paragraph after accepting all changes"""
    xml = p._p.xml
    if "w:del" in xml or "w:ins" in xml:
        tree = XML(xml)
        runs = (node.text for node in tree.iter(TEXT) if node.text)
        # Note: on older versions it is `tree.getiterator` instead of `tree.iter`
        return "".join(runs)
    else:
        return p.text


doc = Document("Hello.docx")

for p in doc.paragraphs:
    print(p.text)
    print("---")
    print(get_accepted_text(p))
    print("=========")
Jean-Francois T.
  • 11,549
  • 7
  • 68
  • 107
2

Not directly using python-docx; there's no API support yet for tracked changes/revisions.

It's a pretty tricky job, which you'll discover if you search on the element names, perhaps 'open xml w:ins' for a start, that brings up this document as the first result: https://msdn.microsoft.com/en-us/library/ee836138(v=office.12).aspx

If I needed to do something like that in a pinch I'd get the body element using:

body = document._body._body

and then use XPath on that to return the elements I wanted, something vaguely like this aircode:

from docx.text.paragraph import Paragraph

inserted_ps = body.xpath('./w:ins//w:p')
for p in inserted_ps:
    paragraph = Paragraph(p, None)
    print(paragraph.text)

You'll be on your own for figuring out what XPath expression will get you the paragraphs you want.

opc-diag may be a friend in this, allowing you to quickly scan the XML of the .docx package. http://opc-diag.readthedocs.io/en/latest/index.html

scanny
  • 26,423
  • 5
  • 54
  • 80
1

the below code from Etienne worked for me, it's working directly with the document's xml (and not using python-docx)

http://etienned.github.io/posts/extract-text-from-word-docx-simply/

Community
  • 1
  • 1
yiftah
  • 71
  • 1
  • 6
0

I needed a quick solution to make text surrounded by "smart tags" visible to docx's text property, and found that the solution could also be adapted to make some tracked changes visible.

It uses lxml.etree.strip_tags to remove surrounding "smartTag" and "ins" tags, and promote the contents; and lxml.etree.strip_elements to remove the whole "del" elements.

def para2text(p, quiet=False):
    if not quiet:
        unsafeText = p.text
    lxml.etree.strip_tags(p._p, "{*}smartTag")
    lxml.etree.strip_elements(p._p, "{*}del")
    lxml.etree.strip_tags(p._p, "{*}ins")
    safeText = p.text
    if not quiet:
        if safeText != unsafeText:
            print()
            print('para2text: unsafe:')
            print(unsafeText)
            print('para2text: safe:')
            print(safeText)
            print()
    return safeText

docin = docx.Document(filePath)
for para in docin.paragraphs:
    text = para2text(para)

Beware that this only works for a subset of "tracked changes", but it might be the basis of a more general solution.

If you want to see the xml for a docx file directly: rename it as .zip, extract the "document.xml", and view it by dropping into chrome or your favourite viewer.

0

Here's an improvement over Jean-François T.'s solution that covers additional cases. It also reuses the already parsed XML structure rather than reparsing from string.

  • inserted, moved and deleted text
  • figure annotations (AlternateContent)
from docx.oxml.ns import qn, nsmap

def accepted_text(p):
    def _accepted_text(p):
        text = ''
        for run in p.xpath('w:r[not(w:pPr/w:rPr/w:moveFrom)] | w:ins/w:r'):
            for child in run:
                if child.tag == qn('w:t'):
                    text += child.text or ''
                elif child.tag == qn('w:tab'):
                    text += '\t'
                elif child.tag in (qn('w:br'), qn('w:cr')):
                    text += '\n'
                elif child.tag == qn('mc:AlternateContent'):
                    for nested_p in child.xpath('mc:Choice[1]//w:p', namespaces=nsmap):
                        text += _accepted_text(nested_p)
                        text += '\n'
        return text

    nsmap['mc'] = 'http://schemas.openxmlformats.org/markup-compatibility/2006'
    return _accepted_text(p._p)

Also referenced here: https://github.com/python-openxml/python-docx/issues/340#issuecomment-1473408122

caram
  • 1,494
  • 13
  • 21