26

Summary

I'm using Ruby (ruby 2.1.2p95 (2014-05-08) [x86_64-linux-gnu] on my machine, ruby 1.9.3p484 (2013-11-22 revision 43786) [x86_64-linux] in production environment) and Nori to convert an XML document (initially processed with Nokogiri for some validation) into a Ruby Hash, but I later discovered that Nori is dropping the attributes of the deepest XML elements.

Issue Details and Reproducing

To do this, I'm using code similar to the following:

xml  = Nokogiri::XML(File.open('file.xml')) { |config| config.strict.noblanks }
hash = Nori.new.parse xml.to_s

The code generally works as intended, except for one case. Whenever Nori parses the XML text, it drops element attributes from the leaf elements (i.e. elements that have no child elements).

For example, the following document:

<?xml version="1.0"?>
<root>
  <objects>
    <object>
      <fields>
        <id>1</id>
        <name>The name</name>
        <description>A description</description>
      </fields>
    </object>
  </objects>
</root>

...is converted to the expected Hash (some output omitted for brevity):

irb(main):066:0> xml = Nokogiri::XML(txt) { |config| config.strict.noblanks }
irb(main):071:0> ap Nori.new.parse(xml.to_s), :indent => -2
{
  "root" => {
    "objects" => {
      "object" => {
        "fields" => {
          "id"   => "1",
          "name" => "The name"
          "description" => "A description"
        }
      }
    }
  }
}

The problem shows up when element attributes are used on elements with no children. For example, the following document is not converted as expected:

<?xml version="1.0"?>
<root>
  <objects>
    <object id="1">
      <fields>
        <field name="Name">The name</field>
        <field name="Description">A description</field>
      </fields>
    </object>
  </objects>
</root>

The same Nori.new.parse(xml.to_s), as displayed by awesome_print, shows the attributes of the deepest <field> elements are absent:

irb(main):131:0> ap Nori.new.parse(xml.to_s), :indent => -2
{
  "root" => {
    "objects" => {
      "object" => {
        "fields" => {
          "field" => [
            [0] "The name",
            [1] "A description"
          ]
        },
        "@id"    => "1"
      }
    }
  }
}

The Hash only has their values as a list, which is not what I wanted. I expected the <field> elements to retain their attributes just like their parent elements (e.g. see @id="1" for <object>), not for their attributes to get chopped off.

Even if the document is modified to look as follows, it still doesn't work as expected:

<?xml version="1.0"?>
<root>
  <objects>
    <object id="1">
      <fields>
        <Name type="string">The name</Name>
        <Description type="string">A description</Description>
      </fields>
    </object>
  </objects>
</root>

It produces the following Hash:

{
  "root" => {
    "objects" => {
      "object" => {
        "fields" => {
          "Name"        => "The name",
          "Description" => "A description"
        },
        "@id"    => "1"
      }
    }
  }
}

Which lacks the type="whatever" attributes for each field entry.

Searching eventually lead me to Issue #59 with the last post (from Aug 2015) stating he can't "find the bug in Nori's code."

Conclusion

So, my question is: Are any of you aware of a way to work around the Nori issue (e.g. perhaps a setting) that would allow me to use my original schema (i.e. the one with attributes in elements with no children)? If so, can you share a code snippet that will handle this correctly?

I had to re-design my XML schema and change code at about three times to make it work, so if there's a way to get Nori to behave, and I'm simply not aware of it, I'd like to know what it is.

I'd like to avoid installing more libraries as much as possible just to get this working properly with the schema structure I originally wanted to use, but I'm open to the possibility if it's proven to work. (I'd have to re-factor the code once again...) Frameworks are definitely overkill for this, so please: do not suggest Ruby on Rails or similar full-stack solutions.

Please note that my current solution, based on a (reluctantly) redesigned schema, is working, but it's more complicated to generate and process than the original one, and I'd like to go back to the simpler/shallower schema.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
code_dredd
  • 5,915
  • 1
  • 25
  • 53
  • I would suggest to create your own recursive method "xml to json". You can do it with nokogiri. – andoke Jul 07 '16 at 07:36
  • @andoke: I'd appreciate if you could elaborate on that, maybe with an answer that includes proof-of-concept code. If I'm going to spend any more time on this at work, and refactor the XML document, I need to know that it will actually work and not be a dead-end. – code_dredd Jul 07 '16 at 07:37
  • You can do something like this : http://stackoverflow.com/questions/6478005/how-to-convert-nokogiri-document-object-into-json – andoke Jul 07 '16 at 08:43
  • There is a GitHub Issue for this bug in Nori: [issue #59 “It ignores attributes when a child is a text node”](https://github.com/savonrb/nori/issues/59) – Rory O'Kane Aug 01 '16 at 05:27
  • @RoryO'Kane: Thanks, but I had already come across that before posting the question, and I even mentioned it in the original post. For this post, I was trying to see if anyone knew of a workaround to the issue. – code_dredd Aug 01 '16 at 05:28
  • Sorry, I didn’t notice the existing link in your post. I misread the date on that issue as being in this year rather than in 2014, so I thought that issue had been posted after you wrote this question. – Rory O'Kane Aug 01 '16 at 05:33
  • I simplified my example. I hope it will be clear to you now. The example does not require the `Nori::StringWithAttributes` class be extended, but if you want to use `#inspect` or `to_json` on it and have the attributes be included then you will need to extend it. – G. Allen Morris III Aug 28 '16 at 15:01

1 Answers1

3

Nori is not actually dropping the attributes, they are just not being printed.

If you run the ruby script:

require 'nori'

data = Nori.new(empty_tag_value: true).parse(<<XML)
<?xml version="1.0"?>
<root>
  <objects>
    <object>
      <fields>
        <field name="Name">The name</field>
        <field name="Description">A description</field>
      </fields>
    </object>
  </objects>
</root>
XML

field_list = data['root']['objects']['object']['fields']['field']

puts "text: '#{field_list[0]}' data: #{field_list[0].attributes}"
puts "text: '#{field_list[1]}' data: #{field_list[1].attributes}"

You should get the output

["The name", "A description"]
text: 'The name' data: {"name"=>"Name"}
text: 'A description' data: {"name"=>"Description"}

Which clearly shows that the attribute are there, but are not displayed by the inspect method (the p(x) function being the same as puts x.inspect).

You will notice that puts field_list.inspect outputs ["The name", "A description"]. but field_list[0].attributes prints the attribute key and data.

If you would like to have pp display this you can overload the inspect method in the Nori::StringWithAttributes.

class Nori
  class StringWithAttributes < String
    def inspect
      [attributes, String.new(self)].inspect
    end
  end
end

Or if you wanted to change the output you could overload the self.new method to have it return a different data strcture.

class Nori
  class MyText < Array
    def attributes=(data)
      self[1] = data
    end
    attr_accessor :text
    def initialize(text)
      self[0] = text
      self[1] = {}
    end
  end
  class StringWithAttributes < String
    def self.new(x)
      MyText.new(x)
    end
  end
end

And access the data as a tuple

puts "text: '#{data['root']['objects']['object']['fields']['field'][0].first}' data: #{ data['root']['objects']['object']['fields']['field'][0].last}"

This would make it so you could have the data as JSON or YAML as the text items would look like arrays with 2 elements. pp also works.

{"root"=>
  {"objects"=>
    {"object"=>
      {"fields"=>
        {"field"=>
          [["The name", {"name"=>"Name"}],
           ["A description", {"name"=>"Description"}]]},
       "bob"=>[{"@id"=>"id1"}, {"@id"=>"id2"}],
       "bill"=>
        [{"p"=>["one", {}], "@id"=>"bid1"}, {"p"=>["two", {}], "@id"=>"bid2"}],
       "@id"=>"1"}}}}

This should do what you want.

require 'awesome_print'
require 'nori'

# Copyright (c) 2016 G. Allen Morris III
#
# Awesome Print is freely distributable under the terms of MIT license.
# See LICENSE file or http://www.opensource.org/licenses/mit-license.php
#------------------------------------------------------------------------------
module AwesomePrint
  module Nori

    def self.included(base)
      base.send :alias_method, :cast_without_nori, :cast
      base.send :alias_method, :cast, :cast_with_nori
    end

    # Add Nori XML Node and NodeSet names to the dispatcher pipeline.
    #-------------------------------------------------------------------
    def cast_with_nori(object, type)
      cast = cast_without_nori(object, type)
      if defined?(::Nori::StringWithAttributes) && object.is_a?(::Nori::StringWithAttributes)
        cast = :nori_xml_node
      end
      cast
    end

    #-------------------------------------------------------------------
    def awesome_nori_xml_node(object)
      return %Q|["#{object}", #{object.attributes}]|
    end
  end
end

AwesomePrint::Formatter.send(:include, AwesomePrint::Nori)

data = Nori.new(empty_tag_value: true).parse(<<XML)
<?xml version="1.0"?>
<root>
  <objects>
    <object>
      <fields>
        <field name="Name">The name</field>
        <field name="Description">A description</field>
      </fields>
    </object>
  </objects>
</root>
XML

ap data

as the output is:

{
    "root" => {
        "objects" => {
            "object" => {
                "fields" => {
                    "field" => [
                        [0] ["The name", {"name"=>"Name"}],
                        [1] ["A description", {"name"=>"Description"}]
                    ]
                }
            }
        }
    }
}
G. Allen Morris III
  • 1,012
  • 18
  • 30
  • The 2-line sample at the top, where you suggest that the leaf attributes are just not being printed, actually crashes with `TypeError: no implicit conversion of String into Integer`. Also, if they're just not being printed, why is there a need to go around overloading other classes/methods, which you seem to imply is unnecessary? – code_dredd Aug 28 '16 at 14:11
  • Your updated post worked, though `ap` still fails to print the attributes. Since the attributes are there, is there a way to get them to print automatically (e.g. `ap data, :indent => -2`)? (Note I'm using the `awesome_print` gem.) Also, the bug report I linked in my OP suggests there's an actual issue here. Was this fixed and the issue report just not updated? – code_dredd Sep 01 '16 at 10:18
  • 1
    Gave you the win. Things changed and I no longer need to keep pursuing the issue, but given that you appear to have tested the code (e.g. provided output) and seems like a close-enough work-around, I'm giving you the points. – code_dredd Sep 05 '16 at 20:15