2

How can I unpack a non-standard HTML:

<body>
    <div class="open">
        <div style='style'>Raw name 1</div>
        <p>Text_1</p>
        <p>Text_2</p>
        <p>Text_3</p>
        <p>Text_4</p>
        <p>Text_5</p>         
        <div style='style'>Raw name 5</div>
        <p>Text_1</p>
        <p>Text_2</p>
        <p>Text_3</p>
        <p>Text_4</p>
        <p>Text_5</p>
    </div>
</body>

I want to get a result similar to:

['Raw name 1', Text_1, Text_2, Text_3, Text_4, Text_5]
...
['Raw name 5', Text_1, Text_2, Text_3, Text_4, Text_5]

I tried to do something on this example How to parse a HTML table with Nokogiri?, but nothing happened.

Is it possible to obtain information that I want from such HTML?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Denis T.
  • 113
  • 3
  • 12
  • Welcome to Stack Overflow. You've given us some data, and a desired output, but you didn't show us what you tried. What does "unpack the non-standard HTML" and "nothing happened" mean? I don't see the normal `` tags but don't have a way of telling whether that's missing in your example or in the original source. (It doesn't matter to Nokogiri.) Please read "[mcve]". We need the minimum code that demonstrates the problem. Without that it looks like you didn't try and want us to write it for you. – the Tin Man Jun 07 '17 at 23:06
  • You can't use table parsing for your example. Tables are nested into rows and columns, making it very logical how to iterate over it. Your data isn't nested, except inside the first `
    `. It's a list, and possibly _appears_ as a list of lists once CSS is applied, but the visual layout has nothing to do with the way we have to retrieve the data.
    – the Tin Man Jun 07 '17 at 23:27
  • 1
    What do you mean by non-standard? Regardless of that, whether standard or not does not matter as long as it is a valid HTML. – sawa Jun 08 '17 at 01:18

2 Answers2

3

If I understand correctly this might work for you

require 'nokogiri'
body = <<-BODY 
<body>
    <div class="open">
        <div style='style'>Raw name 1</div>
        <p>Text_1</p>
        <p>Text_2</p>
        <p>Text_3</p>
        <p>Text_4</p>
        <p>Text_5</p>         
        <div style='style'>Raw name 5</div>
        <p>Text_1</p>
        <p>Text_2</p>
        <p>Text_3</p>
        <p>Text_4</p>
        <p>Text_5</p>
    </div>
</body>   
BODY

doc = Nokogiri::HTML(body)
doc.xpath('//body/div').children.each_with_object({}) do |node,obj|
    text = node.text.strip
    obj[text] = [] if node.name == 'div'
    obj[obj.keys.last] << text if node.name == 'p'
end
#=> {"Raw name 1"=>["Text_1", "Text_2", "Text_3", "Text_4", "Text_5"], 
#     "Raw name 5"=>["Text_1", "Text_2", "Text_3", "Text_4", "Text_5"]}

Steps:

  • This follows the xpath to the first div (doc.xpath('//body/div'))
  • Then passes each child (.children) of that div to the block along with an object (.each_with_object({}) do |node,obj|) in this case as an accumulator.
  • It then adds a key for each div tag and assigns it to an empty array(obj[text] = [] if node.name == 'div').
  • It populates the last key with the following p tags (obj[obj.keys.last] << text if node.name == 'p')

The result is a Hash where the keys are the divs and the value is an Array of the following p tags text until it gets to the next div.

engineersmnky
  • 25,495
  • 2
  • 36
  • 52
3

I'd do something like:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<body>
    <div class="open">
        <div style='style'>Raw name 1</div>
        <p>Text_1</p>
        <p>Text_2</p>         
        <div style='style'>Raw name 5</div>
        <p>Text_1</p>
        <p>Text_2</p>
    </div>
</body>
EOT

doc.at('.open').elements.slice_before { |e| e.name == 'div' }.map { |ary|
  ary.map(&:text)
}
# => [["Raw name 1", "Text_1", "Text_2"], ["Raw name 5", "Text_1", "Text_2"]]

Breaking it down a bit:

doc.at('.open').elements.map(&:name) # => ["div", "p", "p", "div", "p", "p"]
doc.at('.open').elements.slice_before { |e| e.name == 'div' }.map { |a| a.map(&:name) } # => [["div", "p", "p"], ["div", "p", "p"]]

elements and slice_before are the magic here.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303