I'm using Apache Tika to extract raw text from various document formats including office.
When extracting text from word documents that include hyperlinks, then only the text is extracted and the information about the hyperlink is lost.
Is there a way to configure the parser so that the underlying link is also extracted?
ParseContext context = new ParseContext();
Detector detector = new DefaultDetector();
Parser parser = new AutoDetectParser(detector);
context.set(Parser.class, parser);
Metadata metadata = new Metadata();
try (TikaInputStream input = TikaInputStream.get(new File(fileName))) {
BodyContentHandler handler = new BodyContentHandler();
parser.parse(input, handler, metadata, context);
String rawText = handler.toString();
input.close();
}