2

I need to extract nodes from html (not inner text so I can preserve the format for further manual investigation). I wrote the below code. But because how traverse works, I got duplicates in the new html file.

This is the real html to parse. http://www.sec.gov/Archives/edgar/data/1750/000104746912007300/a2210166z10-k.htm

Basically I need to extract Item10 and part between "Executive Officers of the Registrant" to the next Item. Item 10 is in all documents, but "Executive Officers of the Registrant" is not in all documents. I need to get the nodes rather than just text because I want to preserve the tables, so in my next step I can parse tables in these sections if there are any.

Sample html:

html = "
<BODY>
<P>Dont need this </P>  
<P>Start</P>
<P>Text To Extract 1 </P>
<P><Font><B>Text to Extract 2 </B></Font></P>
<DIV><TABLE>
<TR>
<TD>Text to Extract 3</TD>
<TD>Text to Extract 4</TD>
</TR>
</TABLE></DIV>
<P>End</P>
</BODY>
"

I want to get:

html = "
<BODY>
<P>Start</P>
<P>Text To Extract 1 </P>
<P><Font><B>Text to Extract 2 </B></Font></P>
<DIV><TABLE>
<TR>
<TD>Text to Extract 3</TD>
<TD>Text to Extract 4</TD>
</TR>
</TABLE></DIV>
<P>End</P>
</BODY>
"

Start to extract when the start_keyword appears. End to extract when the end_keyword appears.

There are multiple sections I need to extract from one html. The keywords can appear in nodes with different names.

doc.at_css('body').traverse do |node|
    inMySection  = false

    if node.text.match(/#{start_keyword}/)
        inMySection = true
    elsif node.text.match(/#{end_keyword}/)
        inMySection = false
    end
    if inMySection
        #Extract the nodes
    end
end

I also tried to use xpath to achieve this without success after referring to these posts:

XPath axis, get all following nodes until

XPath to find all following siblings up until the next sibling of a particular type

Community
  • 1
  • 1
JXU
  • 67
  • 5
  • 2
    It would help if you could post a sample of the html you are extracting from. – Chris Salzberg Jan 11 '13 at 00:38
  • The keywords are plaintext and can exist in text nodes anywhere in the document? Do you want to extract the node that contains the start keyword? Parent containers? I agree with @shioyama that you should post a sample, and I think you should also show what you want to extract. – Mark Thomas Jan 11 '13 at 01:53
  • 1
    Without HTML to test against we're shooting in the dark, making up test cases. And, is there an error, or are we supposed to make one up? If there is an error, show us what is wrong. – the Tin Man Jan 11 '13 at 02:50
  • Can the keywords be in the middle of a paragraph? What happens when they cross hierarchies, e.g. `

    START content

    And then we END

    `? What should content should be extracted?
    – Phrogz Jan 11 '13 at 03:11
  • sorry for the late reply. Thought I set up email alert but didn't get any email. Added sample html. – JXU Jan 14 '13 at 07:23
  • @JXU did you see my answer? – toch Mar 27 '13 at 09:04

1 Answers1

1

It's not a problem with Nokogiri but your algorithm. You've put your flag inMySection inside your loop, that means at each step you set it again to false and you lose if it was previously set to true.

Based on your sample HTML input and output, the following snippet works:

nodes = Nokogiri::HTML(html)
inMySection  = false
nodes.at_xpath('//body').traverse do |node|
  if node.text.match(/Start/)
    inMySection = true
  elsif node.text.match(/End/)
    inMySection = false
  end
  node.remove unless inMySection
end
print nodes
toch
  • 3,905
  • 2
  • 25
  • 34