4

I'm using Apache Tika to extract raw text from various document formats including office.

When extracting text from word documents that include hyperlinks, then only the text is extracted and the information about the hyperlink is lost.

Is there a way to configure the parser so that the underlying link is also extracted?

    ParseContext context = new ParseContext();
    Detector detector = new DefaultDetector();
    Parser parser = new AutoDetectParser(detector);
    context.set(Parser.class, parser);
    Metadata metadata = new Metadata();

    try (TikaInputStream input = TikaInputStream.get(new File(fileName))) {

        BodyContentHandler handler = new BodyContentHandler();
        parser.parse(input, handler, metadata, context);

        String rawText = handler.toString();

        input.close();
    }
Matthias
  • 178
  • 2
  • 6
  • 1
    Ask Tika to give you the HTML version of the file, rather than the Plain Text version as you are now? – Gagravarr Nov 11 '15 at 14:16
  • 1
    This is a possible workaround but additional post processing is required to handle / remove HTML tags. – Matthias Nov 11 '15 at 19:42
  • 1
    You could ask Tika for it twice, once as HTML which you grab the links from, and once as Plain Text which you use? Otherwise, yes, if you want links you'll need to look through the HTML for the `a` tags – Gagravarr Nov 11 '15 at 22:51

1 Answers1

0

I'm using tika-app to extract hyperlinks from office documents in bash. I'm using the --html option to output the HTML content of files. I'm then using sed and grep to filter the HTML to just the contents of href attributes in that HTML. The result I get is the content of each href, one per line.

java -jar /root/tika-app-1.20.jar --html TEST.docx 2>/dev/null | sed 's/href/\nhref/g' | grep '^href' | sed 's/href="//' | sed 's/".*//'

I know that OP is not using tika-app, but the general approach can be applied using Tika from Java too.

Liam
  • 19,819
  • 24
  • 83
  • 123