Jsoup eats extra information of DocType if it includes a linebreak

Question

When I want to println a downloaded file using Jsoup some information from the DocType are missing if there is a linebreak in it. Is this intended or is this a bug?

For example:

The DocType looks like that:

 <!DOCTYPE html
      PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

And if I print it using doc.html() or doc.toString() I get:

 <!DOCTYPE html>

If there is no linkebreak in it I get the complete DocType including all information.

What can I do to solve that?

Cheers Bene

score 2 · Answer 1 · answered Aug 28 '11 at 06:42

2

Yes, that's a bug. Thanks for pointing it out. The tokeniser wasn't correctly ignoring whitespace between the doctype name and the public identifier.

I've fixed the bug and it will be available in jsoup 1.6.2.

answered Aug 28 '11 at 06:42

Jonathan Hedley

10,442
3
36
47

score 0 · Answer 2 · answered Jul 31 '11 at 11:13

My Problem can be solved by bypassing the parser:

org.jsoup.Connection con =Jsoup.connect(url).userAgent(USER_AGENT).timeout(MAX_TIMEOUT).followRedirects(true);
Response resp = con.execute().method(Method.GET);
completeFile = resp.body();
doc = resp.parse();

Now you got the unfiltered/unparsed code in the variable "completeFile" and the nicely parsed one in the Document "doc".

I hope this could be a help for somebody.

Cheers Bene

Jsoup eats extra information of DocType if it includes a linebreak

2 Answers2