How to get all nodes from a HTML document in Ruby with Nokogiri

Question

I'm trying to get all the nodes from a HTML document using Nokogiri.

I have this HTML:

<html>
<body>
  <h1>Header1</h1>
  <h2>Header22</h2>
  <ul>
    <li>Li1</li>
    <ul>
       <li>Li1</li>
       <li>Li2</li>
    </ul>
  </ul>
</body>
</html>

String version:

string_page = "<html><body><h1>Header1</h1><h2>Header22</h2><ul><li>Li1</li><ul><li>Li1</li><li>Li2</li></ul></ul></body></html>"

I created an object:

page = Nokogiri.HTML(string_page)

And I was trying to traverse it:

result = []
page.traverse { |node| result << node.name unless node.name == "text" }
=> ["html", "h1", "h2", "li", "li", "li", "ul", "ul", "body", "html", "document"]

But what I don't like is the order of elements. I need to have an array with same order as they appear:

["html", "body", "h1", "h2", "ul", "li", "ul", "li", "li" ]

I don't need closing tags.

Does anybody have a better solution to accomplish this?

Why are you doing that? It's horribly inefficient to walk through every node by iterating. You could do the same thing using a SAX parser and it'd probably run a lot faster. — the Tin Man, Dec 04 '14 at 16:17

score 7 · Accepted Answer · edited May 23 '17 at 12:25

If you want to see the nodes in order, use a XPath selector like '*' which means "everything", starting from the root node:

require 'nokogiri'
string_page = "<html><body><h1>Header1</h1></body></html>"
doc = Nokogiri::HTML(string_page)
doc.search('*').map(&:name)
# => ["html", "body", "h1"]

But, we don't normally care to iterate over every node, nor do we usually want to. We want to find all nodes of a certain type, or individual nodes, so we look for landmarks in the markup and go from there:

doc.at('h1').text # => "Header1"

or:

html = "<html><body><table><tr><td>cell1</td></tr><tr><td>cell2</td></tr></h1></body></html>"
doc = Nokogiri::HTML(html)
doc.search('table tr td').map(&:text) # => ["cell1", "cell2"]

or:

doc.search('tr td').map(&:text) # => ["cell1", "cell2"]

or:

doc.search('td').map(&:text) # => ["cell1", "cell2"]

Note: there's no reason to use a longer sample HTML string; It just clutters up the question so use a minimal example.

See "How to avoid joining all text from Nodes when scraping" also.

Thank you the Tin Man. Can't believe your solution is that simple! I know how to iterate over nodes but I needed all of them and didn't knew about `*`. I need to save all the nodes because I want to compare the structure of two different websites. I ended up using a longer sample HTML to make sure I have enough levels of nesting and to evidentiate the importance of order. — radubogdan, Dec 04 '14 at 21:13

How to get all nodes from a HTML document in Ruby with Nokogiri

1 Answers1