12

HtmlUnit for Java is great but I haven't been able to figure out how to view the full source or return the source of a web site as a string. can anyone help me with this?

I know the follow will read the site but now I just want to return the source to a string.

HtmlPage mySite = webClient.getPage("http://mysite.com");

Thanks!

Jake Sankey
  • 4,977
  • 12
  • 39
  • 53

3 Answers3

17

From looking through the API, my thought would be:

mySite.getWebResponse().getContentAsString();
Jeremy
  • 22,188
  • 4
  • 68
  • 81
  • the `toString()` method will definitely not work, I am not sure about the second though. Sounds like it might work but I have never tried it. – Jesse Webb May 13 '11 at 20:00
  • mySite.getWebResponse().getContentAsString(); works! it returns all of the source as if you chose "view source" from the page context menu! Thanks! – Jake Sankey May 13 '11 at 20:14
  • 2
    That is what the `asXml()` method does on HtmlPage. This may be the "accepted" answer, but that is not the way HtmlUnit intended you to get that information. – Jesse Webb May 13 '11 at 20:55
  • 1
    `asXml()` and `page.getWebResponse().getContentAsString()` is not exactly the same, as I just noticed. The former would remove the ` ` and replace it with ``. There may be other differences too, like an altered source tree, so beware. – Stoffe Jun 27 '12 at 11:00
14
String pageSource = myPage.asXml();

That will get you the full HTML source of the web page.

String pageText = myPage.asText();

That will get you all of the visible text on the page, including line breaks/white space. It would be the same if you were on the page in your browser and Ctrl+A then Ctrl+V into a variable.

Jesse Webb
  • 43,135
  • 27
  • 106
  • 143
2

have you tried mySite.asXml()? Or you can do mySite.getDocumentElement().toString()

Kal
  • 24,724
  • 7
  • 65
  • 65