1

HYPERLINK "target"label

How can i extract hyperlinks from a HWPF document? I can get paragraphs from the doc file and extract the correct styling if necessary, i.e. bold, italic etc. But how would i identify and extract hyperlinks from a paragraph?

Diyarbakir
  • 1,999
  • 18
  • 36

1 Answers1

1

The .doc format doesn't store hyperlinks in the simplest of ways, as you've noticed...

A Hyperlink will be a single CharacterRun, with special markers on it. Once you have detected it, just split up the text based on the quotes.

There's a good example of doing this in Apache Tika, look at the handleSpecialCharacterRuns method of WordExtractor to see it done.

Gagravarr
  • 47,320
  • 10
  • 111
  • 156
  • 1
    The Hyperlink is not a single CharacterRun in my case. I expected it to be, but it isn't. While debugging i saw that 1 hyperlink was split into 2 CharacterRuns instead of 1: HYPERLINK "target" and the next run gave me the "label". I will investigate this further. Thanks for the link. – Diyarbakir Dec 01 '11 at 11:16
  • That might be a POI bug - make sure you're using POI 3.8 beta 4 (or a newer nightly build) – Gagravarr Dec 01 '11 at 12:17