I'm crawling some HTML files with crawler4j and I want to replace all links in those pages with custom links. Currently I can get the source HTML and a list of all outgoing links with this code:
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String html = htmlParseData.getHtml();
List<WebURL> links = htmlParseData.getOutgoingUrls();
However a simple foreach
loop and search & replace won't get me what I want. The problem is that athe WebURL.getURL();
will return the absolute URL but sometimes the links are relative and sometimes are not.
I want to handle all links (Images, URLs, JavaScript files, etc.). For instance I want to replace images/img.gif
with view.php?url=http://www.domain.com/images/img.gif
.
The only solution that comes to me is using a somewhat complicated Regex
but I'm afraid I'm going to miss some rare cases. Has this been done already? Is there a library or some tool to achive this?