2

I am converting a Word (2003 and 2007) document to HTML format. I have managed to read the text, formats etc from the Word document. But the document contains some hidden text like 'Header Change History' which need not be displayed on the page. Is there any way to identify hidden texts from a Word document.

Any help will be much valuable.

Thilo
  • 257,207
  • 101
  • 511
  • 656
Albin Joseph
  • 1,020
  • 3
  • 16
  • 25
  • From what I see in POI documentation you can only read and manipulate header, but there is not option to retrieve history. Happy to admit to be wrong if someone can pin-point relevant reference. – peter_budo Aug 23 '11 at 09:47

2 Answers2

3

I am not sure if this is a complete (or even accurate) solution, but for the files in the DOCX format, it seems that you can check if a character run is hidden by

XWPFRun cr;
if (cr.getCTR().getRPr().getVanish() != null){
   // it is hidden
}

Got this from reverse-engineering the XML, and at least in my usage it seems to work. Would be very glad for additional (more informed) input, and a way to do the same thing in the old binary file format.

Thilo
  • 257,207
  • 101
  • 511
  • 656
2

The following code snippet helps in identifying if the text is hidden

POIFSFileSystem fs = null;

    boolean isHidden = false;
    try {
        fs = new POIFSFileSystem(new FileInputStream(filesname));
        HWPFDocument doc = new HWPFDocument(fs);
        WordExtractor we = new WordExtractor(doc);

        String[] paragraphs = we.getParagraphText();

        System.out.println("Word Document has " + paragraphs.length
                + " paragraphs");
        Range range = doc.getRange();

        for (int k = 0; k < range.numParagraphs(); k++) {

            org.apache.poi.hwpf.usermodel.Paragraph paragraph = range
                    .getParagraph(k);
            paragraph.text().trim();
            paragraph.text().replaceAll("\\cM?\r?\n", "");

            for (int j = 0; j < paragraph.numCharacterRuns(); j++) {

                org.apache.poi.hwpf.usermodel.CharacterRun cr = paragraph
                        .getCharacterRun(j);

                if (cr.isVanished()) {
                    // it is hidden
                    System.out.println("text is hidden ");
                    isHidden = true;
                    break;
                }

            }