0

I have a task where I need to put placeholders in my .docx files and automatically replace them with information that I have. I tried having ${VARNAME} as the placeholder syntax but in the document.xml for that docx file I see $, {, VARIABLE and } broken up into 4 different character runs. On what basis WORD chooses this. Is there a way so that this does not happen?

For replacing placeholder and manipulating docx files I am using docx4j. I am extracting the w:t nodes via XPATH. Recently I tried having placeholder syntax as only $VARNAME and this was not broken up. Can I consider it a foolproof naming convention for placeholder. If not can u suggest how can I tackle this situation. Would introducing custom tags in docx help? Any advice appreciated.

Aditya Bahuguna
  • 647
  • 7
  • 22
  • 2
    This question has been asked before. Word never guarantees a single run. It will split a run for spelling, grammar, editing (rsid). You can turn some of those things off. Or you can tidy the document before processing; see https://github.com/plutext/docx4j/blob/master/src/main/java/org/docx4j/model/datastorage/migration/VariablePrepare.java – JasonPlutext Mar 01 '16 at 06:42
  • Made my day!! :) Thanks – Aditya Bahuguna Mar 01 '16 at 07:08

1 Answers1

1

You can never assume that Word will not break up a character run. There is no guaranteed way. You either need to change your approach for extracting the information, by not relying on everything being in a single <w:t> tag, or you need to use a different kind of "target".

Word does not support "custom tags", so that's not an option.

More reliable is to use a ContentControl (std tag). That Word Open XML looks something like this:

<w:sdt>
  <w:sdtPr>
    <w:alias w:val="test"/><w:tag w:val="test"/><w:id w:val="803656476"/>
    <w:placeholder>
      <w:docPart w:val="B4C191A9BCFE488E807F3919BC721619"/>
    </w:placeholder>
    <w:text/>
  </w:sdtPr>
  <w:sdtContent>
    <w:p>
      <w:r>
        <w:t>Content to be changed by code.</w:t>
      </w:r>
    </w:p>
  </w:sdtContent>
</w:sdt>

The VARNAME would be either the w:alias or the w:tag (your choice). These correspond to the Title and Tag properties, respectively, in the Word UI and object model. There's no way these are going to get broken up.

From there, you get the <w:t> descendant of the <w:sdtContent> element.

If you wish, the content control can be mapped to a Node in a Custom XML Part stored in the document. (Unlike custom tags in the text Word does support adding xml files in the document's Zip package.) In that case, it's possible for your code to address the Custom XML file, rather than the document.xml in order to read/write content. The changes will be reflected in the content controls linked to the nodes.

Cindy Meister
  • 25,071
  • 21
  • 34
  • 43
  • Hey can u tell a bit more how to link node to content control. Is ur answer in ref to docx4j? A link or walk trough would be so helpful! – Aditya Bahuguna Mar 01 '16 at 11:57