1

When I want to println a downloaded file using Jsoup some information from the DocType are missing if there is a linebreak in it. Is this intended or is this a bug?

For example:

The DocType looks like that:

 <!DOCTYPE html
      PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

And if I print it using doc.html() or doc.toString() I get:

 <!DOCTYPE html>

If there is no linkebreak in it I get the complete DocType including all information.

What can I do to solve that?

Cheers Bene

Community
  • 1
  • 1
Bene
  • 41
  • 2

2 Answers2

2

Yes, that's a bug. Thanks for pointing it out. The tokeniser wasn't correctly ignoring whitespace between the doctype name and the public identifier.

I've fixed the bug and it will be available in jsoup 1.6.2.

Jonathan Hedley
  • 10,442
  • 3
  • 36
  • 47
0

My Problem can be solved by bypassing the parser:

org.jsoup.Connection con =Jsoup.connect(url).userAgent(USER_AGENT).timeout(MAX_TIMEOUT).followRedirects(true);
Response resp = con.execute().method(Method.GET);
completeFile = resp.body();
doc = resp.parse();   

Now you got the unfiltered/unparsed code in the variable "completeFile" and the nicely parsed one in the Document "doc".

I hope this could be a help for somebody.

Cheers Bene

Bene
  • 41
  • 2