0

as title shows, how do i return a list of urls under (a href) reference and display it in a text file ? The code below return the html form a a website.

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;

public class Main {
    public static void main(String[] args)  {
        try {
            URL my_url = new URL("http://www.placeofjo.blogspot.com/");
            BufferedReader br = new BufferedReader(
               new InputStreamReader(my_url.openStream()));
            String strTemp = "";
            while(null != (strTemp = br.readLine())){
                System.out.println(strTemp);
            }
        } catch (Exception ex) {
            ex.printStackTrace();
        }
    }
}
Jonathon Faust
  • 12,396
  • 4
  • 50
  • 63
Wai Loon II
  • 259
  • 3
  • 7
  • 19

3 Answers3

3

You sound like you want to be using an HTML parsing library like HtmlUnit, rather than getting into the hassle of parsing the HTML yourself. The HtmlUnit code would be as simple as:

final WebClient webClient = new WebClient();
webClient.setJavaScriptEnabled(false);
final HtmlPage page = webClient.getPage("http://www.placeofjo.blogspot.com/");

//  Then iterate through
for (DomElement element : page.getElementsByTagName("a")){
    String link = ((HtmlAnchor)element).getHrefAttribute();
    System.out.println(link);
}

Gives output of:

http://www.twitter.com/jozefinfin/
http://www.facebook.com/jozefinfin/
http://placeofjo.blogspot.com/2008_08_01_archive.html
... etc etc
http://placeofjo.blogspot.com/2011_02_01_archive.html
http://endlessdance.blogspot.com
http://blogskins.com/me/aaaaaa
http://weheartit.com
Matthew Gilliard
  • 9,298
  • 3
  • 33
  • 48
  • hi, thanks for your reply and attention..i am currently testing your code..but i got this error (HTML anchor cannot be resolved to a type).. would appreciate it if you could guide me ^^ – Wai Loon II Feb 28 '11 at 15:43
  • I tested the code briefly here. `HtmlAnchor` is in the latest htmlunit release (2.8) which you can download from http://sourceforge.net/projects/htmlunit/files/htmlunit/2.8/ – Matthew Gilliard Feb 28 '11 at 15:54
  • OK I modified my answer - just had to turn javascript off. – Matthew Gilliard Feb 28 '11 at 16:52
1

You might want to try parsing the HTML with jsoup and collect all the anchor tags from the page.

Jeremy
  • 22,188
  • 4
  • 68
  • 81
-1

Edit (2)

If you're looking for a robust solution (or might need to extend to parsing more HTML), then check out one of the other answers here. If you just want a quick-and-dirty, one time solution you might consider regex.


If I understand you correctly, you want to extract the href values for all <a> tags in the HTML you're fetching.

You can use regular expressions. Something like

String regex = "<a\s.*href=['\"](.*?)['\"].*?>";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(text);
while (m.find())
{
    String urlStr = m.group();
}

Edit (1)

Corrected the regex - we want reluctant quantifiers otherwise we'll end up getting everything!

no.good.at.coding
  • 20,221
  • 2
  • 60
  • 51
  • 1
    You cannot parse HTML with regex. See the top answer at http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags for a good explanation. – Matthew Gilliard Feb 28 '11 at 15:17
  • 1
    I can see this leading towards http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – posdef Feb 28 '11 at 15:19
  • Well, if all you are looking for is `href="something"` then regex would be fine. – Jeremy Feb 28 '11 at 15:19
  • The questioner is looking for all `href="something"` inside well-formed anchor elements, taking into account possible non-standard spacing and other edge cases. Not as simple as it first seems. For example, the current version posted would fail on – Matthew Gilliard Feb 28 '11 at 15:23
  • @mjg123 - You're right, if your applications calls for a full-fledged HTML scraper/parser then regex is too brittle to work well. However, if you have a specialized case or have limited, known HTML to work with then an HTML parser is probably overkill. I think both approaches have their place - regex's being for simple, quick, one time jobs. – no.good.at.coding Feb 28 '11 at 15:24