0

I know it's a pretty noob question once again, but I'm stumbling through the internet for some days now and can't solve my problem. I've downloaded the data dumps from discogs, a xml-file with roughly 35 GB. I've got so far that I will have to use a SAX-Parser because I obviously can't load this file into my RAM, and that ox got the best runtime in ruby, but I simply don't understand how to use this parser, even with small IO-Objects or something just for testing it is still a magical thing throwing things back to me I don't understand. This is what the xml looks like:

<releases>
<release id="1" status="Accepted"><images><image height="600" type="primary" uri="" uri150="" width="600"/><image height="600" type="secondary" uri="" uri150="" width="600"/><image height="600" type="secondary" uri="" uri150="" width="600"/><image height="600" type="secondary" uri="" uri150="" width="600"/></images><artists><artist><id>1</id><name>The Persuader</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists><title>Stockholm</title><labels><label catno="SK032" id="5" name="Svek"/></labels><extraartists><artist><id>239</id><name>Jesper Dahlbäck</name><anv></anv><join></join><role>Music By [All Tracks By]</role><tracks></tracks></artist></extraartists><formats><format name="Vinyl" qty="2" text=""><descriptions><description>12"</description><description>33 ⅓ RPM</description></descriptions></format></formats><genres><genre>Electronic</genre></genres><styles><style>Deep House</style></styles><country>Sweden</country><released>1999-03-00</released><notes>The song titles are the names of six of Stockholm's 82 districts.

Title on label: - Stockholm -

Recorded at the Globe Studio, Stockholm

FAX: +46 8 679 64 53

</notes><data_quality>Needs Vote</data_quality><tracklist><track><position>A</position><title>Östermalm</title><duration>4:45</duration></track><track><position>B1</position><title>Vasastaden</title><duration>6:11</duration></track><track><position>B2</position><title>Kungsholmen</title><duration>2:49</duration></track><track><position>C1</position><title>Södermalm</title><duration>5:38</duration></track><track><position>C2</position><title>Norrmalm</title><duration>4:52</duration></track><track><position>D</position><title>Gamla Stan</title><duration>5:16</duration></track></tracklist><identifiers><identifier description="A-Side Runout" type="Matrix / Runout" value="MPO SK 032 A1"/><identifier description="B-Side Runout" type="Matrix / Runout" value="MPO SK 032 B1"/><identifier description="C-Side Runout" type="Matrix / Runout" value="MPO SK 032 C1"/><identifier description="D-Side Runout" type="Matrix / Runout" value="MPO SK 032 D1"/><identifier description="Only On A-Side Runout" type="Matrix / Runout" value="G PHRUPMASTERGENERAL T27 LONDON"/></identifiers><videos><video duration="326" embed="true" src="https://www.youtube.com/watch?v=afMHNll9EVM"><title>The Persuader - Gamla Stan</title><description>The Persuader - Gamla Stan</description></video><video duration="301" embed="true" src="https://www.youtube.com/watch?v=EBBHR3EMN50"><title>The Persuader - Norrmalm</title><description>The Persuader - Norrmalm</description></video><video duration="341" embed="true" src="https://www.youtube.com/watch?v=WDZqiENap_U"><title>The Persuader - Södermalm</title><description>The Persuader - Södermalm</description></video><video duration="176" embed="true" src="https://www.youtube.com/watch?v=XExCZfMCXdo"><title>The Persuader - Kungsholmen</title><description>The Persuader - Kungsholmen</description></video><video duration="376" embed="true" src="https://www.youtube.com/watch?v=Cawyll0pOI4"><title>The Persuader - Vasastaden</title><description>The Persuader - Vasastaden</description></video><video duration="296" embed="true" src="https://www.youtube.com/watch?v=MpmbntGDyNE"><title>The Persuader - Östermalm</title><description>The Persuader - Östermalm</description></video></videos><companies><company><id>271046</id><name>The Globe Studios</name><catno></catno><entity_type>23</entity_type><entity_type_name>Recorded At</entity_type_name><resource_url>https://api.discogs.com/labels/271046</resource_url></company><company><id>56025</id><name>MPO</name><catno></catno><entity_type>17</entity_type><entity_type_name>Pressed By</entity_type_name><resource_url>https://api.discogs.com/labels/56025</resource_url></company></companies></release>
<release id="2" status="Accepted"><images><image height="394" type="primary" uri="" uri150="" width="400"/><image height="600" type="secondary" uri="" uri150="" width="600"/><image height="600" type="secondary" uri="" uri150="" width="600"/></images><artists><artist><id>2</id><name>Mr. James Barth &amp; A.D.</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists><title>Knockin' Boots Vol 2 Of 2</title><labels><label catno="SK 026" id="5" name="Svek"/><label catno="SK026" id="5" name="Svek"/></labels><extraartists><artist><id>26</id><name>Alexi Delano</name><anv></anv><join></join><role>Producer, Recorded By</role><tracks></tracks></artist><artist><id>27</id><name>Cari Lekebusch</name><anv></anv><join></join><role>Producer, Recorded By</role><tracks></tracks></artist><artist><id>26</id><name>Alexi Delano</name><anv>A. Delano</anv><join></join><role>Written-By</role><tracks></tracks></artist><artist><id>27</id><name>Cari Lekebusch</name><anv>C. Lekebusch</anv><join></join><role>Written-By</role><tracks></tracks></artist></extraartists><formats><format name="Vinyl" qty="1" text=""><descriptions><description>12"</description><description>33 ⅓ RPM</description></descriptions></format></formats><genres><genre>Electronic</genre></genres><styles><style>Broken Beat</style><style>Techno</style><style>Tech House</style></styles><country>Sweden</country><released>1998-06-00</released><notes>All joints recorded in NYC (Dec.97).</notes><data_quality>Correct</data_quality><master_id is_main_release="true">713738</master_id><tracklist><track><position>A1</position><title>A Sea Apart</title><duration>5:08</duration></track><track><position>A2</position><title>Dutchmaster</title><duration>4:21</duration></track><track><position>B1</position><title>Inner City Lullaby</title><duration>4:22</duration></track><track><position>B2</position><title>Yeah Kid!</title><duration>4:46</duration></track></tracklist><identifiers><identifier description="Side A Runout Etching" type="Matrix / Runout" value="MPO SK026-A -J.T.S.-"/><identifier description="Side B Runout Etching" type="Matrix / Runout" value="MPO SK026-B -J.T.S.-"/></identifiers><videos><video duration="268" embed="true" src="https://www.youtube.com/watch?v=LgLchSRehhc"><title>Mr. James Barth &amp; A.D. - Dutchmaster</title><description>Mr. James Barth &amp; A.D. - Dutchmaster</description></video><video duration="297" embed="true" src="https://www.youtube.com/watch?v=x_Os7b-iWKs"><title>Mr. James Barth &amp; A.D. - Yeah Kid!</title><description>Mr. James Barth &amp; A.D. - Yeah Kid!</description></video><video duration="314" embed="true" src="https://www.youtube.com/watch?v=MIgQNVhYILA"><title>Mr. James Barth &amp; A.D. - A Sea Apart</title><description>Mr. James Barth &amp; A.D. - A Sea Apart</description></video><video duration="267" embed="true" src="https://www.youtube.com/watch?v=iaqHaULlqqg"><title>Mr. James Barth &amp; A.D. - Inner City Lullaby</title><description>Mr. James Barth &amp; A.D. - Inner City Lullaby</description></video></videos><companies><company><id>266169</id><name>JTS Studios</name><catno></catno><entity_type>29</entity_type><entity_type_name>Mastered At</entity_type_name><resource_url>https://api.discogs.com/labels/266169</resource_url></company><company><id>56025</id><name>MPO</name><catno></catno><entity_type>17</entity_type><entity_type_name>Pressed By</entity_type_name><resource_url>https://api.discogs.com/labels/56025</resource_url></company></companies></release>
<release id="3" status="Accepted"><images><image height="595" type="primary" uri="" uri150="" width="600"/><image height="472" type="secondary" uri="" uri150="" width="600"/><image height="600" type="secondary" uri="" uri150="" width="599"/><image height="470" type="secondary" uri="" uri150="" width="600"/></images><artists><artist><id>3</id><name>Josh Wink</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists><title>Profound Sounds Vol. 1</title><labels><label catno="CK 63628" id="6" name="Ruffhouse Records"/></labels><extraartists><artist><id>3</id><name>Josh Wink</name><anv></anv><join></join><role>DJ Mix</role><tracks></tracks></artist></extraartists><formats><format name="CD" qty="1" text=""><descriptions><description>Compilation</description><description>Mixed</description></descriptions></format></formats><genres><genre>Electronic</genre></genres><styles><style>Techno</style><style>Tech House</style></styles><country>US</country><released>1999-07-13</released><notes>1: Track title is given as "D2" (which is the side of record on the vinyl version of i220-010 release). This was also released on CD where this track is listed on 8th position. On both version no titles are given (only writing/producing credits). Both versions of i220-010 can be seen on the master release page [m27265]. Additionally this track contains female vocals that aren't present on original i220-010 release. &#13;
4: Credited as J. Dahlbäck. &#13;
5: Track title wrongly given as "Vol. 1". &#13;
6: Credited as Gez Varley presents Tony Montana. &#13;
12: Track exclusive to Profound Sounds Vol. 1.</notes><data_quality>Correct</data_quality><master_id is_main_release="false">66526</master_id><tracklist><track><position>1</position><title>Untitled 8</title><duration>7:00</duration><artists><artist><id>5</id><name>Heiko Laux</name><anv></anv><join>&amp;</join><role></role><tracks></tracks></artist><artist><id>4</id><name>Johannes Heil</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists></track><track><position>2</position><title>Anjua (Sneaky 3)</title><duration>5:28</duration><artists><artist><id>15525</id><name>Karl Axel Bissler</name><anv>K.A.B.</anv><join></join><role></role><tracks></tracks></artist></artists></track><track><position>3</position><title>When The Funk Hits The Fan (Mood II Swing When The Dub Hits The Fan)</title><duration>5:25</duration><artists><artist><id>7</id><name>Sylk 130</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists><extraartists><artist><id>8</id><name>Mood II Swing</name><anv></anv><join></join><role>Remix</role><tracks></tracks></artist></extraartists></track><track><position>4</position><title>What's The Time, Mr. Templar</title><duration>4:27</duration><artists><artist><id>1</id><name>The Persuader</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists></track><track><position>5</position><title>Vol. 2</title><duration>5:36</duration><artists><artist><id>267132</id><name>Care Company (2)</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists></track><track><position>6</position><title>Political Prisoner</title><duration>3:37</duration><artists><artist><id>6981</id><name>Gez Varley</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists></track><track><position>7</position><title>Pop Kulture</title><duration>5:03</duration><artists><artist><id>11</id><name>DJ Dozia</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists></track><track><position>8</position><title>K-Mart Shopping (Hi-Fi Mix)</title><duration>5:42</duration><artists><artist><id>10702</id><name>Nerio's Dubwork</name><anv></anv><join>Meets</join><role></role><tracks></tracks></artist><artist><id>233190</id><name>Kathy Lee</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists><extraartists><artist><id>23</id><name>Alex Hi-Fi</name><anv></anv><join></join><role>Remix</role><tracks></tracks></artist></extraartists></track><track><position>9</position><title>Lovelee Dae (Eight Miles High Mix)</title><duration>5:47</duration><artists><artist><id>13</id><name>Blaze</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists><extraartists><artist><id>14</id><name>Eight Miles High</name><anv></anv><join></join><role>Remix</role><tracks></tracks></artist></extraartists></track><track><position>10</position><title>Sweat</title><duration>6:06</duration><artists><artist><id>67226</id><name>Stacey Pullen</name><anv></anv><join>Presents</join><role></role><tracks></tracks></artist><artist><id>7554</id><name>Black Odyssey</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists><extraartists><artist><id>67226</id><name>Stacey Pullen</name><anv></anv><join></join><role>Presenter</role><tracks></tracks></artist></extraartists></track><track><position>11</position><title>Silver</title><duration>3:16</duration><artists><artist><id>3906</id><name>Christian Smith &amp; John Selway</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists></track><track><position>12</position><title>Untitled</title><duration>2:46</duration><artists><artist><id>3</id><name>Josh Wink</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists></track><track><position>13</position><title>Boom Box</title><duration>3:41</duration><artists><artist><id>19</id><name>Sound Associates</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists></track><track><position>14</position><title>Track 2</title><duration>3:39</duration><artists><artist><id>20</id><name>Percy X</name><anv></anv><join></join><role></role><tracks></tracks></artist></artists></track></tracklist><identifiers><identifier type="Barcode" value="074646362822"/></identifiers>

Just inserted it as a snippet, was the easiest way, sorry. What I want to do now is look out for special release id's, check if they've got a barcode and get that one back if there is one. Could anyone please point me into the right direction? Greetings and thanks in advance, rtuz2th

rtuz2th
  • 79
  • 8

1 Answers1

1

SAX is "evented" XML parsing. A handler has methods that are called for:

  • entering an element (opening element occurs, i.e. <child>)
  • exiting an element (closing element occurs, i.e. </child>)
  • attribute found
  • element text/body found

The handler needs to keep track of the position it currently is in the XML and the values it is interested in. So it can decide what to do when it encounters the elements it is interested in.

Your example XML is kind of big, so I made up my own small sample:

xml = <<EOS
<root>
  <child id="1">
    <barcode value="1111">
  </child>
  <child id="2">
  </child>
  <child id="1">
    <barcode value="2222">
  </child>
  <child id="4">
    <barcode value="3333">
  </child>
</root>
EOS

Im trying to find child elements having an odd ID and a even barcode value. For this simple example i'm keeping track of all tags and attributes on a stack, discarding the state when exiting an element (@stack.pop). Depending on the depth of your XML document and the amount of tags/attributes this might be to "expensive".

require "ox"
require "stringio"

class Handler < ::Ox::Sax
  def initialize
    @stack = []
  end

  def start_element(element_name)
    @stack << [element_name, {}]
  end

  def end_element(element_name)
    parent_name, parent_attributes = @stack[-2]
    if parent_name == :child && parent_attributes[:id].to_i.odd?
      name, attributes = @stack[-1]
      if name == :barcode && attributes[:value].to_i.even?
        puts "Here is one record that seems interesting: Child: #{parent_attributes[:id]}, Barcode: #{attributes[:value]}"
      end
    end
    @stack.pop
  end

  def attr(attribute_name, attribute_value)
    _name, attributes = @stack.last
    attributes[attribute_name] = attribute_value
  end

end

handler = Handler.new
Ox.sax_parse(handler, StringIO.new(xml))

This will print

Here is one record that seems interesting: Child: 1, Barcode: 2222

Pascal
  • 8,464
  • 1
  • 20
  • 31
  • Okay, I will definetely need some time to understand what you've done, thank you so much for your time! What exactly does your attr do? This is a complete mystery to me, sorry. – rtuz2th Mar 19 '18 at 15:35