Apache POI extract hyperlinks from word document

Question

Anyone knows how to extract links from word documents using Apache POI? Or even better, from a paragraph?

Old style .doc files, or new style .docx ones? (It's slightly different) — Gagravarr, Oct 24 '11 at 13:53

score 5 · Answer 1 · edited Dec 15 '20 at 03:57

Word 2003 and newer :

//Links extractor
StringBuffer text = null;
try {
    FileInputStream fis = new FileInputStream(new File("YOUR_DOCX_FULL_PATH_HERE"));
    XWPFDocument document = new XWPFDocument(fis);
    text = new StringBuffer();  
    
    // First up, all our paragraph based text
    Iterator<XWPFParagraph> i = document.getParagraphsIterator();
    while(i.hasNext()) {
        XWPFParagraph paragraph = i.next();

        // Do the paragraph text
        for(XWPFRun run : paragraph.getRuns()) {
           
           if(run instanceof XWPFHyperlinkRun) {
               text.append(run.toString());
               bean.setName(run.toString());
               XWPFHyperlink link = ((XWPFHyperlinkRun)run).getHyperlink(document);
               if(link != null) {
                   text.append(" <" + link.getURL() + ">");
               }
           }
        }
    }
} catch (Exception e) {
    e.printStackTrace();
}

It's generally best not to open a XWFP/XSSF/XSLF instance from an InputStream if you have the file, as it causes the whole lot to have to be buffered into memory. Much better to open directly via the File instead — Gagravarr, May 02 '12 at 13:45

Apache POI extract hyperlinks from word document

1 Answers1