Extract DOCX Comments

Question

I'm a teacher. I want a list of all the students who commented on the essay I assigned, and what they said. The Drive API stuff was too challenging for me, but I figured I could download them as a zip and parse the XML.

The comments are tagged in w:comment tags, with w:t for the comment text and . It should be easy, but XML (etree) is killing me.

via the tutorial (and official Python docs):

z = zipfile.ZipFile('test.docx')
x = z.read('word/comments.xml')
tree = etree.XML(x)

Then I do this:

children = tree.getiterator()
for c in children:
    print(c.attrib)

Resulting in this:

{}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}author': 'Joe Shmoe', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}id': '1', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}date': '2017-11-17T16:58:27Z'}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidR': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidDel': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidP': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidRDefault': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidRPr': '00000000'}
{}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': '0'}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': '0'}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': '0'}

And after this I am totally stuck. I've tried element.get() and element.findall() with no luck. Even when I copy/paste the value ('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val'), I get None in return.

Can anyone help?

The text content of an element is in the `text` property. Does `print(c.text)` produce anything of interest? — mzjn, Nov 20 '17 at 12:41
`a = tree.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val') print(a.text)` Results in `AttributeError` `for c in children: print(c.text)` Results in the comments! Do you know how I would access the the other fields? — Andrew Saltz, Nov 20 '17 at 13:46

score 9 · Accepted Answer · answered Nov 20 '17 at 15:35

9

You got remarkably far considering that OOXML is such a complex format.

Here's some sample Python code showing how to access the comments of a DOCX file via XPath:

from lxml import etree
import zipfile

ooXMLns = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}

def get_comments(docxFileName):
  docxZip = zipfile.ZipFile(docxFileName)
  commentsXML = docxZip.read('word/comments.xml')
  et = etree.XML(commentsXML)
  comments = et.xpath('//w:comment',namespaces=ooXMLns)
  for c in comments:
    # attributes:
    print(c.xpath('@w:author',namespaces=ooXMLns))
    print(c.xpath('@w:date',namespaces=ooXMLns))
    # string value of the comment:
    print(c.xpath('string(.)',namespaces=ooXMLns))

answered Nov 20 '17 at 15:35

kjhughes

106,133
27
181
240

An answer and a compliment! Thank you! – Andrew Saltz Nov 20 '17 at 18:58
On getting the text the comment relates to: 1) the hardway with accepted xpath solution is to read `docxZip.read('word/document.xml')`, then parse it to find comment anchors (w:commentRangeStart) by the comment's id...so much work for a one off solution :-) 2) the easy way is to use the example with win32 api, since we are talking not about server deployment, but a local desktop script for personal use to parse few school essays... Coding time matters. The win32 example just does it. – Alex V Sep 08 '19 at 14:03

Shashank Shekhar Shukla · Answer 2 · 2021-05-27T18:56:44.753

Thank you @kjhughes for this amazing answer for extracting all the comments from the document file. I was facing same issue like others in this thread to get the text that the comment relates to. I took the code from @kjhughes as a base and try to solve this using python-docx. So here is my take at this.

Sample document.

I will extract the comment and the paragraph which it was referenced in the document.

from docx import Document
from lxml import etree
import zipfile
ooXMLns = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
#Function to extract all the comments of document(Same as accepted answer)
#Returns a dictionary with comment id as key and comment string as value
def get_document_comments(docxFileName):
    comments_dict={}
    docxZip = zipfile.ZipFile(docxFileName)
    commentsXML = docxZip.read('word/comments.xml')
    et = etree.XML(commentsXML)
    comments = et.xpath('//w:comment',namespaces=ooXMLns)
    for c in comments:
        comment=c.xpath('string(.)',namespaces=ooXMLns)
        comment_id=c.xpath('@w:id',namespaces=ooXMLns)[0]
        comments_dict[comment_id]=comment
    return comments_dict
#Function to fetch all the comments in a paragraph
def paragraph_comments(paragraph,comments_dict):
    comments=[]
    for run in paragraph.runs:
        comment_reference=run._r.xpath("./w:commentReference")
        if comment_reference:
            comment_id=comment_reference[0].xpath('@w:id',namespaces=ooXMLns)[0]
            comment=comments_dict[comment_id]
            comments.append(comment)
    return comments
#Function to fetch all comments with their referenced paragraph
#This will return list like this [{'Paragraph text': [comment 1,comment 2]}]
def comments_with_reference_paragraph(docxFileName):
    document = Document(docxFileName)
    comments_dict=get_document_comments(docxFileName)
    comments_with_their_reference_paragraph=[]
    for paragraph in document.paragraphs:  
        if comments_dict: 
            comments=paragraph_comments(paragraph,comments_dict)  
            if comments:
                comments_with_their_reference_paragraph.append({paragraph.text: comments})
    return comments_with_their_reference_paragraph
if __name__=="__main__":
    document="test.docx"  #filepath for the input document
    print(comments_with_reference_paragraph(document))

Output for the sample document look like this

I have done this at a paragraph level. This could be done at a python-docx run level as well. Hopefully it will be of help.

Maitraye Das · Answer 3 · 2018-07-16T21:25:15.083

I used Word Object Model to extract comments with replies from a Word document. Documentation on Comments object can be found here. This documentation uses Visual Basic for Applications (VBA). But I was able to use the functions in Python with slight modifications. Only issue with Word Object Model is that I had to use win32com package from pywin32 which works fine on Windows PC, but I'm not sure if it will work on macOS.

Here's the sample code I used to extract comments with associated replies:

    import win32com.client as win32
    from win32com.client import constants

    word = win32.gencache.EnsureDispatch('Word.Application')
    word.Visible = False 
    filepath = "path\to\file.docx"

    def get_comments(filepath):
        doc = word.Documents.Open(filepath) 
        doc.Activate()
        activeDoc = word.ActiveDocument
        for c in activeDoc.Comments: 
            if c.Ancestor is None: #checking if this is a top-level comment
                print("Comment by: " + c.Author)
                print("Comment text: " + c.Range.Text) #text of the comment
                print("Regarding: " + c.Scope.Text) #text of the original document where the comment is anchored 
                if len(c.Replies)> 0: #if the comment has replies
                    print("Number of replies: " + str(len(c.Replies)))
                    for r in range(1, len(c.Replies)+1):
                        print("Reply by: " + c.Replies(r).Author)
                        print("Reply text: " + c.Replies(r).Range.Text) #text of the reply
        doc.Close()

This should be the real answer, as the previous one doesn't deal with the referenced text. — lucid_dreamer, Feb 26 '19 at 15:37
@lucid_dreamer: Word Automation requires Microsoft Word and is unreliable in server deployments. If you need the referenced text in a purely python solution that doesn't require Word, ask a new question. — kjhughes, Jul 25 '19 at 11:45

score 0 · Answer 4 · answered Jan 19 '23 at 08:45

If you want also the text the comments relates to :

def get_document_comments(docxFileName):
       comments_dict = {}
       comments_of_dict = {}
       docx_zip = zipfile.ZipFile(docxFileName)
       comments_xml = docx_zip.read('word/comments.xml')
       comments_of_xml = docx_zip.read('word/document.xml')
       et_comments = etree.XML(comments_xml)
       et_comments_of = etree.XML(comments_of_xml)
       comments = et_comments.xpath('//w:comment', namespaces=ooXMLns)
       comments_of = et_comments_of.xpath('//w:commentRangeStart', namespaces=ooXMLns)
       for c in comments:
          comment = c.xpath('string(.)', namespaces=ooXMLns)
          comment_id = c.xpath('@w:id', namespaces=ooXMLns)[0]
          comments_dict[comment_id] = comment
       for c in comments_of:
          comments_of_id = c.xpath('@w:id', namespaces=ooXMLns)[0]
          parts = et_comments_of.xpath(
            "//w:r[preceding-sibling::w:commentRangeStart[@w:id=" + comments_of_id + "] and following-sibling::w:commentRangeEnd[@w:id=" + comments_of_id + "]]",
            namespaces=ooXMLns)
          comment_of = ''
          for part in parts:
             comment_of += part.xpath('string(.)', namespaces=ooXMLns)
             comments_of_dict[comments_of_id] = comment_of
        return comments_dict, comments_of_dict

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). — Community, Jan 21 '23 at 07:25

Extract DOCX Comments

4 Answers4

Linked