0

How do we work with large XML schema(XSD)/XML messages on GreenPlum?

I reviwed the GreenPlum DBA guide. It talks about processing XML messages using STX tranformation.

If we have an XSD and we need to process the XML messages that was compliant with XSD.

How do we approach this situation in GreenPlum?

What are the best approaches?

Are there any other approaches to process XSD based /XML in Greenplum?

Thanks for your insights.

1 Answers1

1

In general, there's no built-in functionality for processing XML files in Greenplum. The thing you are talking about is the gpfdist client-side transformation that allows you to convert your XML to CSV and then process it inside of the database.

The best option in processing the XMLs inside of the Greenplum would be to decompose them to the relational model and load as a set of tables. If you need to process just a general XML files, you can write a function in PL/Java or PL/Python that would parse XML for you and extract the information you need.

Here's an example of the function:

create or replace function parse (inxml varchar, node varchar) returns varchar as $BODY$
from xml.etree import ElementTree
def getText (node, subnode):
    res = None
    if node is not None:
        sn = node.find(subnode)
        if sn is not None:
            res = sn.text
        else:
            for el in node:
                res2 = getText(el, subnode)
                if res2 is not None:
                    res = res2
                    break
    return res
et = ElementTree.fromstring(inxml.strip())
return getText(et, node)
$BODY$
language plpythonu
volatile;

And here's an example of its invocation:

select parse (
'<?xml version="1.0" encoding="utf-8"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Dont forget me this weekend!</body>
<attachments>
<attachment>file1</attachment>
<attachment>file2</attachment>
<attachment>file3</attachment>
</attachments>
</note>',
'attachment');
0x0FFF
  • 4,948
  • 3
  • 20
  • 26
  • Thanks for your input. Your sample function takes the XML message as function input. It is OK for small XML message. We have number of large XML messages as text files (.xml) in linux server. How do we parse those using your function? Is there a mechanism to pass the linux file system file as input to your parse function? I know only gpfdist read text files (non XML files). – user2647763 - RIMD Nov 24 '14 at 02:13
  • If you are operating with 5-10KB XML files you can store them as a text in database and parse on demand. If you are working with 10KB-10MB XML files you'd better write ETL that would translate the XML files into the relational model and then load it into database. If you have 10MB+ XML files it might be reasonable to use HDFS + MapReduce or Spark to parse them in a distributed fashion and store deep history. The function I wrote is just a python - write it as a Python function and use to parse specific file, it is easy to do. The same python function can be used on Spark – 0x0FFF Nov 24 '14 at 07:52
  • Thanks. "The function I wrote is just a python - write it as a Python function and use to parse specific file" How this function differ from posted (above) one? – user2647763 - RIMD Jun 13 '20 at 02:23