3

This is my sample.xml:

<?xml version="1.0" encoding="utf-8"?>
<ShipmentRequest>
   <Message>
      <Header>
      <MemberId>MID-0000001</MemberId>    
      <MemberName>Bruce</MemberName>
      <DeliveryId>0000001</DeliveryId>
      <OrderNumber>ON-000000001</OrderNumber>
      <ShipToName>Alan</ShipToName>
      <ShipToZip>123-4567</ShipToZip>
      <ShipToStreet>West</ShipToStreet>
      <ShipToCity>Seatle</ShipToCity>
       <Payments>
        <PayType>Credit Card</PayType>
        <Amount>20</Amount>
      </Payments>
      <Payments>
        <PayType>Points</PayType>
        <Amount>22</Amount>
      </Payments>
      <PayType />
      </Header>
    <Line>
      <LineNumber>3.1</LineNumber>
      <ItemId>A-0000001</ItemId>
      <Description>Apple</Description>
      <Quantity>2</Quantity>
      <UnitCost>5</UnitCost>
    </Line>
    <Line>
      <LineNumber>4.1</LineNumber>
      <ItemId>P-0000001</ItemId>
      <Description>Peach</Description>
      <Quantity>4</Quantity>
      <UnitCost>6</UnitCost>
    </Line>
    <Line>
      <LineNumber>5.1</LineNumber>
      <ItemId>O-0000001</ItemId>
      <Description>Orange</Description>
      <Quantity>2</Quantity>
      <UnitCost>4</UnitCost>
    </Line>
  </Message>
</ShipmentRequest>

And my sample.rb:

#!/usr/bin/ruby -w

require 'nokogiri'

doc = Nokogiri::XML(open("sample.xml"))
doc.xpath("//ShipmentRequest").each {
  |node| puts node.text
}

And the results I get:

MID-0000001    
Bruce
0000001
ON-000000001
Alan
123-4567
West
Seatle

Credit Card
20


Points
22




3.1
A-0000001
Apple
2
5


4.1
P-0000001
Peach
4
6


5.1
O-0000001
Orange
2
4

I'd like also to print tag names and skip tags/nodes with blank values:

MemberID: MID-0000001

MemberName: Bruce

DeliveryId: 0000001

OrderNumber: ON-000000001

ShipToName: Alan

ShipToZip: 123-4567

ShipToStreet: West

etc...
Phrogz
  • 296,393
  • 112
  • 651
  • 745
Askar
  • 5,784
  • 10
  • 53
  • 96
  • 1
    A `ShipmentRequest` node can contain more than one `Message` node? How do you want nested nodes (i.e. `Line` and `Payments`) to look like in the output? – toro2k May 31 '13 at 09:06
  • In my case, I know there will be only one Message node. I want each tag name and its value to be printed in the order I showed on my post. I just need to add tag name for each line and skip/ignore empty tags. – Askar May 31 '13 at 09:09
  • Your output is exactly what you're asking for: `` contains multiple child nodes, many of which contain text nodes. What did you expect would happen when you take a high-level node and try to get all `text`? – the Tin Man May 31 '13 at 17:19
  • @the Tin Man, I was not able to print tag names and skip empty nodes. – Askar Jun 01 '13 at 03:09

2 Answers2

9

You basically want all the leaf elements. You can capture all of them in a single XPath expression:

leaves = doc.xpath('//*[not(*)]')

leaves.each do |node|
  puts "#{node.name}: #{node.text}" unless node.text.empty?
end

Output:

MemberId: MID-0000001
MemberName: Bruce
DeliveryId: 0000001
OrderNumber: ON-000000001
ShipToName: Alan
ShipToZip: 123-4567
ShipToStreet: West
ShipToCity: Seatle
PayType: Credit Card
Amount: 20
PayType: Points
Amount: 22
LineNumber: 3.1
ItemId: A-0000001
Description: Apple
Quantity: 2
UnitCost: 5
LineNumber: 4.1
ItemId: P-0000001
Description: Peach
Quantity: 4
UnitCost: 6
LineNumber: 5.1
ItemId: O-0000001
Description: Orange
Quantity: 2
UnitCost: 4

Explanation of XPath

The XPath //*[not(*)] finds all the leaf elements. How does it do that? Let's break it down:

  • The // means scan through the entire document.
  • The * means any element, so //* matches all elements in the document.
  • The part in [] is called a predicate and it constrains the previous expression. I read it like a "such that". Its scope is the children of the element, so for example a[b] means all the a elements such that they have a b child.
  • The not() simply is a boolean negation, so not(*) means "no element", so in a predicate it means "no child element".

Putting it all together, you have "all elements in the document such that they do not have any child elements" == leaf elements.

Another version

In the comments, @Phrogz made a nice addition, moving the logic checking whether the element is empty to the XPath expression by adding another predicate. This has two benefits:

  • It will have improved performance because it doesn't return all leaves and then check them. This might be noticeable in a large document or if there are lots of empty leaves.
  • It becomes a one-liner!

puts doc.xpath('//*[not(*)][text()]').map{ |n| "#{n.name}: #{n.text}" }

Meaning "Every element that has no child elements, but that does have at least one child text node."

Mark Thomas
  • 37,131
  • 11
  • 74
  • 101
  • @MarkThomas : Can you briefly describe what `xpath(//*not[(*)])` does, I am trying to find some documentation that details the usage of `not` but am unable to find one. – Anand Shah Jun 01 '13 at 07:40
  • Note that you can put the test for _"has any text"_ into the XPath expression as well, for better performance: `puts doc.xpath('//*[not(*)][text()]').map{ |n| "#{n.name}: #{n.text}" }`, meaning _"Every element that has no child elements, but that does have at least one child text node."_ – Phrogz Jun 03 '13 at 03:21
0
doc = Nokogiri::XML(File.open("sample.xml"))

doc.xpath("//ShipmentRequest/Message/Header").each do |row|
  row.elements.each do |e|
    next if e.text.to_s.empty? 
    if e.name.match(/Payments/)
      e.elements.each do |ie|
        puts "#{ie.name} : #{ie.text}"
      end      
    else
      puts "#{e.name} : #{e.text}"
    end
  end
end

doc.xpath("//ShipmentRequest/Message/Line").each do |row|
  row.elements.each do |e|
    next if e.text.to_s.empty?
    puts "#{e.name} : #{e.text}"
  end
end

Output

MemberId : MID-0000001
MemberName : Bruce
DeliveryId : 0000001
OrderNumber : ON-000000001
ShipToName : Alan
ShipToZip : 123-4567
ShipToStreet : West
ShipToCity : Seatle
PayType : CreditCard
Amount : 20
PayType : Points
Amount : 22
LineNumber : 3.1
ItemId : A-0000001
Description : Apple
Quantity : 2
UnitCost : 5
LineNumber : 4.1
ItemId : P-0000001
Description : Peach
Quantity : 4
UnitCost : 6
LineNumber : 5.1
ItemId : O-0000001
Description : Orange
Quantity : 2
UnitCost : 4
Anand Shah
  • 14,575
  • 16
  • 72
  • 110
  • What's xmldoc? I thought you meant my "doc", but it doesn't seem so... Can you please paste the whole code? I also would not like print the tag name payments, as I will have their children PayType and Amount. – Askar May 31 '13 at 09:36
  • I thought so... :) Can you please post that code here http://stackoverflow.com/questions/16810539/accessing-xml-file-with-rexml – Askar May 31 '13 at 14:22