31

I wrote a web crawler in Ruby and I'm using Nokogiri::HTML to parse the page. I need to print the page out and while messing around in IRB I noticed a pretty_print method. However it takes a parameter and I can't figure out what it wants.

My crawler is caching the HTML of the webpages and writing it to files on my local machine. I would like to "pretty print" the HTML so that it looks nice and properly formatted when I do so.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Jarsen
  • 7,432
  • 6
  • 27
  • 26
  • 1
    what are you wanting to print? the html content (tags and all) or select items? there are different methods for each and a clarification would really help with for an answer – user214028 Dec 14 '09 at 09:01

8 Answers8

84

The answer by @mislav is somewhat wrong. Nokogiri does support pretty-printing if you:

  • Parse the document as XML
  • Instruct Nokogiri to ignore whitespace-only nodes ("blanks") during parsing
  • Use to_xhtml or to_xml to specify pretty-printing parameters

In action:

html = '<section>
<h1>Main Section 1</h1><p>Intro</p>
<section>
<h2>Subhead 1.1</h2><p>Meat</p><p>MOAR MEAT</p>
</section><section>
<h2>Subhead 1.2</h2><p>Meat</p>
</section></section>'

require 'nokogiri'
doc = Nokogiri::XML(html,&:noblanks)
puts doc
#=> <section>
#=>   <h1>Main Section 1</h1>
#=>   <p>Intro</p>
#=>   <section>
#=>     <h2>Subhead 1.1</h2>
#=>     <p>Meat</p>
#=>     <p>MOAR MEAT</p>
#=>   </section>
#=>   <section>
#=>     <h2>Subhead 1.2</h2>
#=>     <p>Meat</p>
#=>   </section>
#=> </section>

puts doc.to_xhtml( indent:3, indent_text:"." )
#=> <section>
#=> ...<h1>Main Section 1</h1>
#=> ...<p>Intro</p>
#=> ...<section>
#=> ......<h2>Subhead 1.1</h2>
#=> ......<p>Meat</p>
#=> ......<p>MOAR MEAT</p>
#=> ...</section>
#=> ...<section>
#=> ......<h2>Subhead 1.2</h2>
#=> ......<p>Meat</p>
#=> ...</section>
#=> </section>
Ian Bytchek
  • 8,804
  • 6
  • 46
  • 72
Phrogz
  • 296,393
  • 112
  • 651
  • 745
  • 1
    Seems like it doesn't split chain of tags into several lines, but writes them one after another. Such problem appears in originally-one-tag-per-line document after the http://stackoverflow.com/questions/2696537 – after that code tags somehow are joining to one chain, making this to_xhtml a bit useless ( – Nakilon Nov 16 '12 at 05:33
  • @Nakilon Did you parse the XML using the &:noblanks option? – Phrogz Nov 16 '12 at 05:47
  • yep, http://pastebin.com/raw.php?i=tKSSVjaG – remove `if false` to see, how `change_language` urls are joining. (smth wrong with my browser or SO, can't write ur username with @, it's just disappearing, lol) – Nakilon Nov 16 '12 at 06:01
  • @Nakilon Looks fine to me; what am I missing? (Since I'm the author of the item you're commenting on the @ is unnecessary for notification, so SO tries to be helpful by not letting you add it.) – Phrogz Nov 16 '12 at 13:45
  • This `
  • – Nakilon Nov 16 '12 at 20:33
  • @Nakilon You should ask this as a new question with a simple repro case (including simple XML). – Phrogz Nov 16 '12 at 21:44
  • 1
    Could you please point me to the link/source where you found `&:noblanks` in the official doc ? – Arup Rakshit Feb 22 '14 at 18:07
  • 3
    @ArupRakshit That is Ruby shortcut for `Nokogiri.XML(…){|config| config.noblanks }`. The `Nokogiri.XML()` method is documented as a shortcut for [`Nokogiri::XML::Document.parse`](http://nokogiri.org/Nokogiri/XML/Document.html#method-c-parse). The block passed to the method is a shorthand for passing [parse options](http://nokogiri.org/Nokogiri/XML/ParseOptions.html). – Phrogz Feb 22 '14 at 18:28
  • I am applying the same [here](http://stackoverflow.com/questions/21957190/how-to-wrap-nokogiri-nodeset-in-one-span) not happening. – Arup Rakshit Feb 22 '14 at 18:28
  • Unfortunately this wont wrap long lines into multiple ones. – DavidGamba May 01 '14 at 15:33
  • @DavidG You can do that yourself rather easily, with matching indentation, even. `wrapped = result.gsub(/^([ \t]*)(.{70,})(.+)/,'\1\2\n\1\3')` – Phrogz May 01 '14 at 16:00
  • i tried a lot, the key is `&:noblanks` at the reading. saved my day! – Tim Kretschmer Nov 04 '16 at 06:14
  • I get a huge torrent of nonsense full of cucumber pollution lasting about a minute of high-speed scrolling, when I type puts Nokogiri::XML('', &:noblanks) at the Pry prompt. Kernel.puts works better. I hate Ruby. – android.weasel Mar 24 '17 at 13:54
  • @android.weasel and what happens when you call `to_xhtml` on the document? And, you realize that you're parsing an empty string as a document? What do you hope to have happen? Post as your own question of you cannot figure it out, not as a complaint comment on another answer. – Phrogz Mar 24 '17 at 14:14