0

I have a project which takes large amounts of XML data and passes that to Nokogiri, eventually adding each element to a hash outputting to a YAML file.

This is works until the XML data set contains duplicate keys.

Example Data:

<document>
    <form xmlns="">
        <title>
            <main-title>Foo</main-title>
        </title>
        <homes>
            <home>
                <home-name>home 1</home-name>
                <home-price>10</home-price>
            </home>
            <home>
                <home-name>home 2</home-name>
                <home-price>20</home-price>
            </home>
        </homes>
    </form>
</document>

Within the homes element I can have multiple homes, however each home will always contain different content.

This data should eventually output a structure like this:

title:
  main-title: Foo
homes:
  home:
    home-name: home 1
    home-price: 10
  home:
    home-name: home 2
    home-price: 20

However all I ever get is the last element inside homes

title:
      main-title: Foo
    homes:
      home:
        home-name: home 2
        home-price: 20

I believe this to be because, when adding each element to the hash, it will simply overwrite the key if it already exists, thus always giving me the last key.

This is the code used to append elements to the hash:

def map_content(nodes, content_hash)
        nodes.map do |element|
          case element
          when Nokogiri::XML::Element
            child_content = map_content(element.children, {})
            content_hash[element.name] = child_content unless child_content.empty?
          when Nokogiri::XML::Text
            return element.content
          end
        end
        content_hash
      end

I believe

content_hash[element.name] = child_content

is the culprit, however this code creates similar YAML files that have these types of duplicate keys, and I'd like to preserve that functionality, so I don't want to simply add a unique key to the data hash as this would mean I'd need to modify many methods and update how they pull data from the YAML file.

I read about compare_by_identity but not sure if how I would implement this.


I tried using compare_by_identity but it just results in an empty YAML file, so maybe it's generating the hash but it can't be written to the YAML file?

def map_content(nodes, content_hash)
        content_hash = content_hash.compare_by_identity

        nodes.map do |element|
          case element
          when Nokogiri::XML::Element
            child_content = map_content(element.children, {})
            content_hash[element.name] = child_content unless child_content.empty?
          when Nokogiri::XML::Text
            return element.content
          end
        end
        content_hash
      end
    end
Artjom B.
  • 61,146
  • 24
  • 125
  • 222
chinds
  • 1,761
  • 4
  • 28
  • 54
  • What about having one 'homes' key and putting everything beneath it in an array? The structure you propose will not work, since it will give you the same result (with only the last :home key) when loading it back into your application. – Severin May 09 '17 at 07:53
  • Hmm I am trying to not modify the current structure, however this may be the only way. – chinds May 09 '17 at 08:14
  • Your desired YAML output isn't possible. The YAML, when parsed, will result in a hash with duplicate keys, which you've already found out isn't possible. You have to use an array of hashes. – the Tin Man May 10 '17 at 22:17

2 Answers2

1

compare_by_identity is easy in principle:

hash = {}.compare_by_identity
hash[String.new("home")] = { "home-name" => "home 1", "home-price" => "10" }
hash[String.new("home")] = { "home-name" => "home 2", "home-price" => "20" }
hash
# => {"home"=>{"home-name"=>"home 1", "home-price"=>"10"}, "home"=>{"home-name"=>"home 2", "home-price"=>"20"}} 

(I use String.new to force the literal strings in source code to be different objects. You would not need this, as Nokogiri would dynamically construct string objects, and they would have different object_id.)

I.e. literally all you need to do is call .compare_by_identity on each Hash you make. However, this is not without its price: access stops working.

hash["home"]
# => nil

You would need to explicitly check each element's equality now:

hash.to_a.select { |k, v| k == "home" }.map { |k, v| v }
# => [{"home-name"=>"home 1", "home-price"=>"10"}, {"home-name"=>"home 2", "home-price"=>"20"}]

As Severin notes, it will also have dire repercussions if you put it into YAML or JSON, as you will not be able to load it properly back again.

Another approach you could take, and a much preferred one, is to leave XML peculiarities to XML, and transform the structure into something more JSON-y (and thus representable directly by Hash and Array objects). For example,

class MultiValueHash < Hash
  def add(key, value)
    if !has_key?(key)
      self[key] = value
    elsif Array === self[key]
      self[key] << value
    else
      self[key] = [self[key], value]
    end
  end
end

hash = MultiValueHash.new
hash.add("home", { "home-name" => "home 1", "home-price" => "10" })
hash.add("home", { "home-name" => "home 2", "home-price" => "20" })
hash.add("work", { "work-name" => "work 1", "work-price" => "30" })
hash["home"]
# => [{"home-name"=>"home 1", "home-price"=>"10"}, {"home-name"=>"home 2", "home-price"=>"20"}]
hash["work"]
# => {"work-name"=>"work 1", "work-price"=>"30"}

The slight problem here is that it is not really possible to distinguish, if you have a single child, whether that child should be an array of one, or a simple value. So when reading, when you want to treat the value as an array, use one of the answers here. For example, if you are not adverse to monkeypatching,

hash["home"].ensure_array
# => [{"home-name"=>"home 1", "home-price"=>"10"}, {"home-name"=>"home 2", "home-price"=>"20"}] 
hash["work"].ensure_array
# => [["work-name", "work 1"], ["work-price", "30"]]
Community
  • 1
  • 1
Amadan
  • 191,408
  • 23
  • 240
  • 301
  • Wow, thanks very much for the write up, reading through your comments I understand the issue with `compare_by_identity`. Id like to give this a go. However I still cannot get it to work, please see the post update – chinds May 09 '17 at 08:10
  • 1
    You haven't shown how you're parsing the XML. Your code will get stuck on whitespace text nodes, because instead of ignoring them you `return` out of the function. If you strip those, it for me: `doc = Nokogiri::XML.parse(xml) do |config| config.noblanks end; map_content(doc.children, {})`. See [How to avoid creating non-significant white space text nodes when creating a `Nokogiri::XML` or `Nokogiri::HTML` object](http://stackoverflow.com/questions/21114933/how-to-avoid-creating-non-significant-white-space-text-nodes-when-creating-a-no) – Amadan May 09 '17 at 09:29
0

I'd do it like this:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<document>
  <form xmlns="">
    <title>
      <main-title>Foo</main-title>
    </title>
    <homes>
      <home>
        <home-name>home 1</home-name>
        <home-price>10</home-price>
      </home>
      <home>
        <home-name>home 2</home-name>
        <home-price>20</home-price>
      </home>
    </homes>
  </form>
</document>
EOT

title = doc.at('main-title').text
homes = doc.search('home').map { |home|
  {
    'home' => {
      'home-name' => home.at('home-name').text,
      'home-price' => home.at('home-price').text.to_i
    }
  }
}

hash = {
  'title' => {'main-title' => title},
  'homes' => homes
}

Which, when converted to YAML, results in:

require 'yaml'
puts hash.to_yaml

# >> ---
# >> title:
# >>   main-title: Foo
# >> homes:
# >> - home:
# >>     home-name: home 1
# >>     home-price: 10
# >> - home:
# >>     home-name: home 2
# >>     home-price: 20

You can't create:

homes:
  home:
    home-name: home 1
    home-price: 10
  home:
    home-name: home 2
    home-price: 20

because the home: elements are keys in the homes hash. It's not possible to have multiple keys with the same name; The second will overwrite the first. Instead, they must be an array of hashes designated as - home as in the above output.

Consider these:

require 'yaml'

foo = YAML.load(<<EOT)
title:
  main-title: Foo
homes:
  home:
    home-name: home 1
    home-price: 10
  home:
    home-name: home 2
    home-price: 20
EOT

foo
# => {"title"=>{"main-title"=>"Foo"},
#     "homes"=>{"home"=>{"home-name"=>"home 2", "home-price"=>20}}}

versus:

foo = YAML.load(<<EOT)
title:
  main-title: Foo
homes:
- home:
    home-name: home 1
    home-price: 10
- home:
    home-name: home 2
    home-price: 20
EOT

foo
# => {"title"=>{"main-title"=>"Foo"},
#     "homes"=>
#      [{"home"=>{"home-name"=>"home 1", "home-price"=>10}},
#       {"home"=>{"home-name"=>"home 2", "home-price"=>20}}]}
the Tin Man
  • 158,662
  • 42
  • 215
  • 303