0

I'm writing some code to load and parse HTML docs from the web.

I'm using JDOM like so:

SAXBuilder parser = new SAXBuilder();
Document document = (Document)parser.build("http://www.google.com");
Element rootNode = document.getRootElement();
/* and so on ...*/

It works fine like that. However, when I change the URL to some other web sites, like "http://www.kijiji.com", for example, the parser.build(...) line hangs.

Any idea why it hangs? I'm wondernig if it might be because kijiji knows I'm not a "real" web browser -- perhaps I have to spoof my http request so it looks like it's coming from IE or something like that?

Any ideas are useful, thanks!

Rob

1 Answers1

0

I think a few things may be going on here. The firdt issue is that you cannot parse regular HTML with JDOM, HTML is not XML....

Secondly, when I run kijiji.com through JDOM I get an immediate HTTP_400 response

When I parse google.com I get an immediate XML error about well-formedness.

If you happen to be parsing xhtml at some point though, you will likely run in to this problem here: http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic/

XHTML has a doctype that references other doctypes, etc. Thes each take 30 seconds to load from w3c.org....

rolfl
  • 17,539
  • 7
  • 42
  • 76