12

Is there any way to access and manipulate text in an existing docx document in a textbox with python-docx?

I tried to find a keyword in all paragraphs in a document by iteration:

doc = Document('test.docx')

for paragraph in doc.paragraphs:
    if '<DATE>' in paragraph.text:
        print('found date: ', paragraph.text)

It is found if placed in normal text, but not inside a textbox.

Stefan
  • 423
  • 1
  • 4
  • 14
  • In Word files, TextBoxes live in a separate object. From cursory googling around, `python-docx` has access to InlineShapes but not to TextBoxes. – Jongware Apr 27 '16 at 11:35

2 Answers2

8

A workaround for textboxes that contain only formatted text is to use a floating, formatted table. It can be styled almost like a textbox (frames, colours, etc.) and is easily accessible by the docx API.

doc = Document('test.docx')

for table in doc.tables:
    for row in table.rows:
        for cell in row.cells:
            for paragraph in cell.paragraphs:
                if '<DATE>' in paragraph.text:
                   print('found date: ', paragraph.text)
Stefan
  • 423
  • 1
  • 4
  • 14
5

Not via the API, not yet at least. You'd have to uncover the XML structure it lives in and go down to the lxml level and perhaps XPath to find it. Something like this might be a start:

body = doc._body
# assuming differentiating container element is w:textBox
text_box_p_elements = body.xpath('.//w:textBox//w:p')

I have no idea whether textBox is the actual element name here, you'd have to sort that out with the rest of the XPath path details, but this approach will likely work. I use similar approaches frequently to work around features that aren't built into the API yet.

opc-diag is a useful tool for inspecting the XML. The basic approach is to create a minimally small .docx file containing the type of thing you're trying to locate. Then use opc-diag to inspect the XML Word generates when you save the file:

$ opc browse test.docx document.xml

http://opc-diag.readthedocs.org/en/latest/index.html

scanny
  • 26,423
  • 5
  • 54
  • 80
  • Thanks a lot for the insight in this general approach. The current project does not satisfiy to dig deep at this particular part - so I found a way to put everything in a floating table instead a textbox. Btw: great work with the docx project. Many thanks and please keep this work going. – Stefan Apr 28 '16 at 06:10
  • 2
    This could be achieved by adding a text frame (framePr) property to a paragraph: http://officeopenxml.com/WPparagraph-textFrames.php – William Payne Jul 16 '16 at 15:33