Oracle 11g: Extract data from XML using xmltable, xmlagg and pivot

Question

I have the following XML:

<record>
    <leader>02220cim a2200325 a 4500</leader>    
    <datafield tag="035" ind1=" " ind2=" ">
      <subfield code="a">P286</subfield>
    </datafield>
    <datafield tag="100" ind1="1" ind2=" ">
      <subfield code="a">Gold, Claudia M.</subfield>
    </datafield>
    <datafield tag="300" ind1=" " ind2=" ">
      <subfield code="a">1 disque son. (6h47) :</subfield>
      <subfield code="b">numérique ;</subfield>
      <subfield code="c">12 cm.</subfield>
    </datafield>  
    <datafield tag="650" ind1=" " ind2="6">
      <subfield code="a">Émotions chez l&apos;enfant</subfield>
    </datafield>
    <datafield tag="650" ind1=" " ind2="6">
      <subfield code="a">Parents et enfants</subfield>
    </datafield>
    <datafield tag="655" ind1=" " ind2="4">
      <subfield code="a">Enregistrement numérique sur CD.</subfield>
    </datafield>
    <datafield tag="700" ind1="1" ind2=" ">
      <subfield code="a">Taussig, Sylvie</subfield>
    </datafield>
    <datafield tag="700" ind1="1" ind2=" ">
      <subfield code="a">Marcelli, Daniel</subfield>
    </datafield>
    <datafield tag="852" ind1=" " ind2=" ">
      <subfield code="a">GBQ</subfield>
      <subfield code="b">QO</subfield>
      <subfield code="c">DC13378</subfield>
      <subfield code="d">32002518936826</subfield>
    </datafield>
    <datafield tag="852" ind1=" " ind2=" ">
      <subfield code="a">GBQ</subfield>
      <subfield code="b">QO</subfield>
      <subfield code="c">DC13378</subfield>
      <subfield code="d">32002518936859</subfield>
    </datafield>    
  </record>

I need to convert the values in rows into columns. For example:

leader    datafield_035   datafield_100   datafield_300  datafield_650 ...
-----------------------------------------------------------------------
02220...

Here is what I did:

1/ Step1:

insert into table_data(leader, tag, subfield) 
(
select x.leader,x1.tag, x2.subfield
from table_xml t
  cross join XMLTABLE (xmlnamespaces (default 'http://www.loc.gov/MARC21/slim')  
        , '/collection/record'  
          passing t.xml_document
          columns leader VARCHAR2(100) PATH 'leader',                                        
          datafield      xmltype       path './datafield'  
        ) x
    cross join XMLTABLE (xmlnamespaces (default 'http://www.loc.gov/MARC21/slim')  
        , '/datafield[@tag=245 or @tag=100 or @tag=300 or @tag=260 or @tag=650 
                or @tag=520 or @tag=091 or @tag=534 or @tag=245 or @tag=500 or @tag=655 or @tag=505]'     
         passing x.datafield
          columns tag VARCHAR2(20) PATH '@tag',           
         subfield xmltype PATH './subfield'
          )x1
      cross join XMLTABLE (xmlnamespaces (default 'http://www.loc.gov/MARC21/slim')  
        , '/subfield'     
         passing x1.subfield                
         COLUMNS subfield varchar2(2000) PATH '.'
          )x2      
where t.id=3
)
;

2/Step 2:

INSERT INTO table_CLOB (leader, tag, subfield)
select leader, tag, 
rtrim(xmlagg(XMLELEMENT(e,subfield, ' ; ').extract('//text()')order by subfield).getClobVal(),' ; ') as subfields
from table_data
group by leader, tag
;

3/ Step3:

insert into table_result(leader, tag, subfield)
(    select leader, "'091'" as cote, "'100'" as auteur, "'245'" as titre,
        "'300'" as Desc_Mater, "'260'" as edition,"'520'" as resume, "'534'" as version,
        "'500'" as notes, "'650'" as sujet
      from
      (
        select leader, tag, dbms_lob.substr(subfield,2000,1) as col
        from table_clob
      ) 
      pivot(max(col) for tag in ('091', '100', '245', '260','650','520','300','534','500'))
    );

After these steps, I get the result that I expect but the performance is not really good.

Is there anyway I can do one shot and get the final result?

Alex Poole · Answer 1 · 2015-11-11T18:09:38.713

You could use subquery factoring (also known as common table expressions, or CTEs) instead of permanent (or even temporary) tables.

But you don't really need to do all that work, you can use use listagg() and your first query:

select x.leader
  , listagg(case when x1.tag = '091' then x2.subfield end, ' ; ')
      within group (order by x2.subfield) as cote
  , listagg(case when x1.tag = '100' then x2.subfield end, ' ; ')
      within group (order by x2.subfield) as auteur
  , listagg(case when x1.tag = '245' then x2.subfield end, ' ; ')
      within group (order by x2.subfield) as titre
  , listagg(case when x1.tag = '300' then x2.subfield end, ' ; ')
      within group (order by x2.subfield) as desc_mater
  , listagg(case when x1.tag = '260' then x2.subfield end, ' ; ')
      within group (order by x2.subfield) as edition
  , listagg(case when x1.tag = '520' then x2.subfield end, ' ; ')
      within group (order by x2.subfield) as resume
  , listagg(case when x1.tag = '534' then x2.subfield end, ' ; ')
      within group (order by x2.subfield) as version
  , listagg(case when x1.tag = '500' then x2.subfield end, ' ; ')
      within group (order by x2.subfield) as notes
  , listagg(case when x1.tag = '650' then x2.subfield end, ' ; ')
      within group (order by x2.subfield) as subjet
from table_xml t
  cross join XMLTABLE ( ... ) x
    cross join XMLTABLE ( ... )x1
      cross join XMLTABLE ( ... )x2      
where t.id=3
group by x.leader;

Which gets the same result form your sample, apart from the apostrophe losing it's entity encoding; which may be a good thing.

SQL Fiddle demo; I was going to put the CTE version on there for reference but it's killing it - it often isn't happy with XML or CTEs, and it's intermittently struggling with the listagg version too; so here's the full version of both:

with table_data as (
select x.leader,x1.tag, x2.subfield
from table_xml t
  cross join XMLTABLE (xmlnamespaces (default 'http://www.loc.gov/MARC21/slim')  
        , '/collection/record'  
          passing t.xml_document
          columns leader VARCHAR2(100) PATH 'leader',                                        
          datafield      xmltype       path './datafield'  
        ) x
    cross join XMLTABLE (xmlnamespaces (default 'http://www.loc.gov/MARC21/slim')  
        , '/datafield[@tag=245 or @tag=100 or @tag=300 or @tag=260 or @tag=650 
                or @tag=520 or @tag=091 or @tag=534 or @tag=245 or @tag=500 or @tag=655 or @tag=505]'     
         passing x.datafield
          columns tag VARCHAR2(20) PATH '@tag',           
         subfield xmltype PATH './subfield'
          )x1
      cross join XMLTABLE (xmlnamespaces (default 'http://www.loc.gov/MARC21/slim')  
        , '/subfield'     
         passing x1.subfield                
         COLUMNS subfield varchar2(2000) PATH '.'
          )x2      
where t.id=3
),
table_clob as (
select leader, tag, 
rtrim(xmlagg(XMLELEMENT(e,subfield, ' ; ').extract('//text()')order by subfield).getClobVal(),' ; ') as subfield
from table_data
group by leader, tag
),
table_result as (
select leader, "'091'" as cote, "'100'" as auteur, "'245'" as titre,
        "'300'" as Desc_Mater, "'260'" as edition,"'520'" as resume, "'534'" as version,
        "'500'" as notes, "'650'" as sujet
      from
      (
        select leader, tag, dbms_lob.substr(subfield,2000,1) as col
        from table_clob
      ) 
      pivot(max(col) for tag in ('091', '100', '245', '260','650','520','300','534','500'))
)
select * from table_result;

select x.leader
  , listagg(case when x1.tag = '091' then x2.subfield end, ' ; ')
      within group (order by x2.subfield) as cote
  , listagg(case when x1.tag = '100' then x2.subfield end, ' ; ')
      within group (order by x2.subfield) as auteur
  , listagg(case when x1.tag = '245' then x2.subfield end, ' ; ')
      within group (order by x2.subfield) as titre
  , listagg(case when x1.tag = '300' then x2.subfield end, ' ; ')
      within group (order by x2.subfield) as desc_mater
  , listagg(case when x1.tag = '260' then x2.subfield end, ' ; ')
      within group (order by x2.subfield) as edition
  , listagg(case when x1.tag = '520' then x2.subfield end, ' ; ')
      within group (order by x2.subfield) as resume
  , listagg(case when x1.tag = '534' then x2.subfield end, ' ; ')
      within group (order by x2.subfield) as version
  , listagg(case when x1.tag = '500' then x2.subfield end, ' ; ')
      within group (order by x2.subfield) as notes
  , listagg(case when x1.tag = '650' then x2.subfield end, ' ; ')
      within group (order by x2.subfield) as subjet
from table_xml t
  cross join XMLTABLE (xmlnamespaces (default 'http://www.loc.gov/MARC21/slim')  
        , '/collection/record'  
          passing t.xml_document
          columns leader VARCHAR2(100) PATH 'leader',                                        
          datafield      xmltype       path './datafield'  
        ) x
    cross join XMLTABLE (xmlnamespaces (default 'http://www.loc.gov/MARC21/slim')  
        , '/datafield[@tag=245 or @tag=100 or @tag=300 or @tag=260 or @tag=650 
                or @tag=520 or @tag=091 or @tag=534 or @tag=245 or @tag=500 or @tag=655 or @tag=505]'     
         passing x.datafield
          columns tag VARCHAR2(20) PATH '@tag',           
         subfield xmltype PATH './subfield'
          )x1
      cross join XMLTABLE (xmlnamespaces (default 'http://www.loc.gov/MARC21/slim')  
        , '/subfield'     
         passing x1.subfield                
         COLUMNS subfield varchar2(2000) PATH '.'
          )x2      
where t.id=3
group by x.leader;

Hi @Alex: thanks but I dont know why I am getting errors : "SQL error 17410 No more data to read from socket" ? — giluy, Nov 11 '15 at 18:38
@giluy - with both approaches? That usually means the DB session crashed - might have something in the server alert log. I ran this in 11.2.0.4; perhaps your version is hitting a bug. Which would be awkward. (Maybe that's what's killing SQL Fiddle, which is on 11.2.0.2). What version gets that error? — Alex Poole, Nov 11 '15 at 18:45
HI @Alex: Mine is Oracle Database 11g Enterprise Edition 11.2.0.4.0 It gives me that error when I put "Create table test_result as (...) ; ". But, in general, both approaches consume a lot of time. I can say it is over 10mins because the XML file is huge, and I use xmlagg instead of listadd (because some fileds are over 4000 characters). — giluy, Nov 11 '15 at 18:53
@giluy - yes, I get that with CTAS too. From the ORA-07445 log looks like a known bug; see MOS document 1943615.1. There may be a workaround, but unless CTAS is vital, creating it first and then doing an insert seems to work OK. If `listagg()` is ruled out because you have values that are too long, the CTE version may still be a little faster than your original since you aren't doing the intermediate inserts; or you could create your own CLOB aggregator. But the size of your document may be dwarfing any savings anyway; I'd test the approaches with a subset to see how much difference they make. — Alex Poole, Nov 11 '15 at 19:23

Oracle 11g: Extract data from XML using xmltable, xmlagg and pivot

1 Answers1