Recognizing colors in text from a docx

Question

I'm trying to write a program that reads a docx file and checks whether some of the text is colored. For instance, imagine if all the words bolded in this sentence were actually written in some arbitrary color. I want my program to recognize that the words "words bolded in this sentence were actually written in some arbitrary color" are colored.

Then after recognizing the coloration, I want to be able to edit the recognized text based on the color. For instance, if the the bolded text above were red, I want to add "Red>" tags around the text, while still keeping intact the rest of the sentence that isn't colored.

I was originally using ZipInputStream and ZipEntry to get the "word/document.xml," and I had planned on pulling the text and colors from there, but I feel like that would get too confusing after a while. I also tried using Apache poi, but I don't think it's able to recognize colors. Docx4j looks promising, though. Any thoughts, suggestions, or sample code to get me started?

score 2 · Answer 1 · answered Oct 08 '13 at 23:04

Font color is a run property:

  <w:r>
    <w:rPr>
      <w:color w:val="FF0000"/>
    </w:rPr>
    <w:t>red</w:t>
  </w:r>

docx4j provides three ways to do stuff with that:

via XPath
via TraversalUtil
via XSLT

I'd recommend TraversalUtil, since XPath is dependent on JAXB's support for it, which isn't always robust (at least in the Sun/Oracle reference implementation).

See the finders package for examples of using this.

But beyond this, the challenge you face is that the color property could be specified via a style (or even as a document default). If you want to take this into account, you need to be looking at the effective run properties (which is what docx4j's PDF output does).

Recognizing colors in text from a docx

1 Answers1