I am currently using JTidy to parse an HTML document and fetch a collection of all anchor tags in the given HTML document. I then extract the value of each tag's href attribute to come up with a collection of links on the page.
Unfortunately, these links can be expressed in a few different ways: some absolute (http://www.example.com/page.html
), some relative (/page.html
, page.html
, or ../page.html
). Even more, some can just be anchors (#paragraphA
). When I visit my page in a browser, it knows automatically how to handle these different href values if I were to click the link, however if I were to follow one of these links retrieved from JTidy using an HTTPClient programatically, I first need to provide a valid URL (so e.g. I would first need to transform /page.html, page.html, and http://www.example.com/page.html to http://www.example.com/page.html).
Is there some built-in functionality, whether in JTidy or elsewhere, that can achieve this for me? Or will I need to create my own rules to transform these different URLs into an absolute URL?