1

I'm trying to write a program that reads a docx file and checks whether some of the text is colored. For instance, imagine if all the words bolded in this sentence were actually written in some arbitrary color. I want my program to recognize that the words "words bolded in this sentence were actually written in some arbitrary color" are colored.

Then after recognizing the coloration, I want to be able to edit the recognized text based on the color. For instance, if the the bolded text above were red, I want to add "Red>" tags around the text, while still keeping intact the rest of the sentence that isn't colored.

I was originally using ZipInputStream and ZipEntry to get the "word/document.xml," and I had planned on pulling the text and colors from there, but I feel like that would get too confusing after a while. I also tried using Apache poi, but I don't think it's able to recognize colors. Docx4j looks promising, though. Any thoughts, suggestions, or sample code to get me started?

nathanchere
  • 8,008
  • 15
  • 65
  • 86
user2858182
  • 175
  • 1
  • 7

1 Answers1

2

Font color is a run property:

  <w:r>
    <w:rPr>
      <w:color w:val="FF0000"/>
    </w:rPr>
    <w:t>red</w:t>
  </w:r>

docx4j provides three ways to do stuff with that:

  • via XPath
  • via TraversalUtil
  • via XSLT

I'd recommend TraversalUtil, since XPath is dependent on JAXB's support for it, which isn't always robust (at least in the Sun/Oracle reference implementation).

See the finders package for examples of using this.

But beyond this, the challenge you face is that the color property could be specified via a style (or even as a document default). If you want to take this into account, you need to be looking at the effective run properties (which is what docx4j's PDF output does).

JasonPlutext
  • 15,352
  • 4
  • 44
  • 84