0

I need to pull a fragment out of a large XML file and work only with that fragment.

xml = <<XMLEND
<CFRDOC xsi:noNamespaceSchemaLocation="CFRMergedXML.xsd">
    <TITLE>
        <SUBTITLE>
            <CHAPTER>
                <TOC></TOC>
                <PART></PART>
                <PART></PART>
                <PART>
                    <EAR>Pt. 1903</EAR>
                    <HD SOURCE="HED">PART 1903—INSPECTIONS, CITATIONS AND PROPOSED PENALTIES</HD>
                    <CONTENTS></CONTENTS>
                    <AUTH></AUTH>
                    <SOURCE></SOURCE>
                    <SECTION>section1</SECTION>
                    <SECTION>section2</SECTION>
                    <SECTION>section3</SECTION>
                    <SECTION>section4</SECTION>
                </PART>
            </CHAPTER>
        </SUBTITLE>
    </TITLE>
</CFRDOC>
XMLEND

doc = Nokogiri::HTML(xml)

section = doc.xpath("//section")

# I can grab a specific node...
section[3].text          
=> "section4"

# copy it 
temp = section[3].dup
=> #<Nokogiri::XML::Element:0x261ce64 name="section" children=[#<Nokogiri::XML::Text:0x261c98c "section4">]>

# but the variable still refers to the whole...
doc.xpath("//part").size
=> 3
section.xpath("//part").size
=> 3
temp.xpath("//part").size 
=> 3

Coming from a PHP background, I'm having to rethink variables a bit. I know variables are different in Ruby; they are pointers to an object.

Therefore, when I run temp.xpath, I'm actually running it on doc. But I'm wanting to grab a specific node and its children, and then work on it as a new object. This would narrow down the haystack immensely and make the rest of my job so much easier!

How do I create a new object using only the node I have selected? I want to turn section[3] into a new object that wouldn't see the other <part>'s and their associated <section> tags.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Tim Morton
  • 2,614
  • 1
  • 15
  • 23
  • Did you find the CFRMergedXML.xsd to validate the cfrdoc? Did you have to combine the xsd with an xjb? – ABC123 Oct 16 '16 at 15:28

2 Answers2

2

"//part" means "start at the top of the document and search to the bottom, finding all <part> nodes.

That's not what you want.

Instead you want:

"./part"

which means "start at the current place and search inside it.

It's easiest to think of XPath as if you're navigating a directory structure on a disk. If you wanted to find a file at the root of the drive you'd use:

/foo

and if you wanted to find a file in the current directory you'd use:

./foo

XPath uses // to say "search from the top to the bottom":

//foo

Instead of XPath, I recommend using CSS selectors unless I need the power of XPath. I find XPath to be visually noisy. So, instead, I'd use:

section = doc.search('section')

and

section.search('part')

Now, meditate on this:

require 'nokogiri'

xml = <<XMLEND
<CFRDOC xsi:noNamespaceSchemaLocation="CFRMergedXML.xsd">
  <TITLE>
    <SUBTITLE>
      <CHAPTER>
        <PART></PART>
        <PART>
          <SECTION>section1</SECTION>
          <SECTION>section2</SECTION>
          <SECTION>section3</SECTION>
          <SECTION>section4</SECTION>
        </PART>
      </CHAPTER>
    </SUBTITLE>
  </TITLE>
</CFRDOC>
XMLEND

doc = Nokogiri::XML(xml)

I reduced the XML for readability.

doc.search('SECTION').map(&:text) # => ["section1", "section2", "section3", "section4"]
doc.search('PART').size # => 2
doc.search('PART[2]').text # => "\n          section1\n          section2\n          section3\n          section4\n        "
doc.search('PART[2]').search('SECTION').map(&:text) # => ["section1", "section2", "section3", "section4"]
doc.search('PART[2] SECTION').map(&:text) # => ["section1", "section2", "section3", "section4"]
doc.search('PART SECTION').map(&:text) # => ["section1", "section2", "section3", "section4"]

Using simple selectors it's easy to drill into a document. Sometimes it's impossible to write a simple selector, so we have to find way-points in the document and navigate from those, but based on the example XML it's pretty straightforward.

See "How to avoid joining all text from Nodes when scraping" also.

Community
  • 1
  • 1
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • Thanks for the quick response. Yes, `//` is a "global" search, but frequently gov't XML fails to encapsulate things in a logical way. For my own sanity, I'm wanting to extract a smaller chunk to work with. Put another way, I'm wanting to do a `//` search on a smaller subset, because sometimes the content I'm looking for is wrapped in another node, sometimes not-- so I gravitate towards using `//` – Tim Morton Nov 11 '15 at 20:27
  • Ah, you expanded on your answer while I was commenting. That does look clean, and it looks a lot simpler to refer to specific nodes. I'll have to consider that. – Tim Morton Nov 11 '15 at 20:52
  • You'll find that complete answers take a while to generate and often go through several updates before they're stable. (Many of us wait a while to see if we like the direction the answers are going before answering too; I don't need points, I just want to see sensible answers.) So, it's a good idea to wait a day before beginning to analyze and select an answer. – the Tin Man Nov 11 '15 at 22:13
  • Clean code is essential, both when writing the code, but even more so when returning to the code months or years later. I harp on maintainability and code we can grok quickly is much better at 3:00AM when the world is in flames. – the Tin Man Nov 11 '15 at 22:15
  • Noted and appreciated :) Hard to wait when you're stumped :/ I have a follow-up question on `doc.search()`: When I was testing it, it only seems to work with `Nokogiri::HTML()` and not `Nokogiri::XML()`? Not sure what's going on with that. – Tim Morton Nov 11 '15 at 23:39
  • Would it be good form to unselect the "answered" check in order to invite more conversation (and possible refining of answers), or would it just be insulting at this point? Last thing I want to do is insult folks that are helping me understand. – Tim Morton Nov 11 '15 at 23:43
  • I don't think you can unselect it, but I've never tried, and I've asked very few questions so I'm not a good resource there. :-) The consensus in the community is to wait a while before selecting answers for the reasons I said. You can "reselect" but that's not important to me, I'd rather you find the answer that works correctly for you, the answer that is efficient and easy to understand, AKA "elegant". – the Tin Man Nov 11 '15 at 23:57
1

Use to_xml to turn temp back into an XML string, then use Nokogiri::XML again to get a new object.

my_section = Nokogiri::XML(temp.to_xml)
my_section.xpath('//part').size
# => 0

puts my_section
# <?xml version="1.0"?>
# <section><section4</section>

(I'm not sure why you're using Nokogiri::HTML to begin with, but you may substitute that back in here for XML if you think you need to.)

user513951
  • 12,445
  • 7
  • 65
  • 82
  • Now that I see it, it's obvious... There was some reason I was using `Nokogiri::HTML` instead of `Nokogiri::XML`. But for the life of me I don't remember what it was or if it was a _good_ reason. – Tim Morton Nov 11 '15 at 21:01
  • 1
    Parsing using `Nokogiri::XML` vs. the HTML variant results in tighter restrictions being applied for XML. See http://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/ParseOptions, the `DEFAULT_XML` definition. HTML is "freakin' sloppy" (my words for it) but that's what libXML needs to understand HTML. Using `to_s` works, but it'd be more proper to use `to_xml`, `to_xhtml` or `to_html` than `to_s`. – the Tin Man Nov 12 '15 at 00:11