1

I'm working with several huge (>2gb) XML files and their size is causing problems.

(For example, I'm using XMLReader in a PHP script to parse smaller ~500mb files, and that works fine, but 32-bit PHP can't open files this large.)

So - my idea is to eliminate big chunks of the file that I know I don't need.

For example, if the structure of the file looks like this:

<record id="1">
    <a>
        <detail>blah</detail>
        ....
        <detail>blah</detail>
    </a>
    <b>
        <detail>blah</detail>
        ....
        <detail>blah</detail>
    </b>
    <c>
        <detail>blah</detail>
        ....
        <detail>blah</detail>
    </c>
</record>
...
<record id="999999">
    <a>
        <detail>blah</detail>
        ....
        <detail>blah</detail>
    </a>
    <b>
        <detail>blah</detail>
        ....
        <detail>blah</detail>
    </b>
    <c>
        <detail>blah</detail>
        ....
        <detail>blah</detail>
    </c>
</record>

For my purposes - I only need the data in parent node <a> for each record. If I could eliminate parent nodes <b> and <c> from every record, I could reduce the size of the file substantially, so it would be small enough to work with normally.

What's the best way to do something like this (hopefully with something like sed or grep or a free/cheap application)?

I've tried a trial version of Altova XML Spy and it won't even open the XML file (I assume it's because it's too large).

Community
  • 1
  • 1
mattstuehler
  • 9,040
  • 18
  • 78
  • 108
  • You want a SAX parser [like `XmlReader`](http://php.net/manual/en/book.xmlreader.php) instead of a DOM parser. – Tomalak Sep 18 '14 at 15:46
  • I do believe you are using such large XML. If so you are using the technology – Ed Heal Sep 18 '14 at 15:47
  • @Tomalak - thanks for your comment, but I can't use XMLReader - it won't open the file. I'm looking for a utility that will remove nodes I know I don't need, so that I can reduce a 2.5gb file to <1gb, so that I ***can*** then use XMLReader. – mattstuehler Sep 18 '14 at 16:18
  • What error does XMLReader give? Alternatively, [`XMLParser`](http://php.net/manual/en/book.xml.php) is worth a try. I'd really urge you to reconsider your proposal of using a non-XML-aware utility like `sed` or `awk` to cut down the size of your file. There are parsers that are made to handle these file sizes, you should use such a tool. – Tomalak Sep 18 '14 at 16:26
  • @Tomalak, XMLReader gives an error when I use `open()`. Here's a question I posted about that: http://stackoverflow.com/questions/25916589/xmlreader-cant-open-large-2gb-xml-files I also tried XMLParser, but I run into the same problem. I think the issue is that 32-bit PHP can't handle files this large, even with pull- or event-based XML parsers. – mattstuehler Sep 18 '14 at 16:47
  • If you open the document in nonvalidating mode, a SAX parser *should* be able to handle this, since it's just streaming the data through and maintaining a very small amount of data to check well-formedness. My go-to parser remains Apache Xerces, and I'd suggest trying their basic SAX example that just reads the XML content and writes it back out as fast as it comes in; if that works (and it should!), you can then modify it to add the filtering logic. – keshlam Sep 18 '14 at 19:51
  • 1
    Of course if the problem is that this document isn't well-formed XML... well, you'll have to fix that before any XML tool will handle it cleanly. – keshlam Sep 18 '14 at 19:52

1 Answers1

2

since you mention sed and awk I assume you are under linux.

If you have xsltproc utility ...

give a corrected version of your test file

<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet href="project.xsl" type="text/xsl"?>

<records>
<record id="1">
    <a>
        <detail>hello</detail>
        bar
        <detail>world</detail>
    </a>
    <b>
        <detail>blah</detail>
        <detail>blah</detail>
    </b>
    <c>
        <detail>blah</detail>
        ....
        <detail>blah</detail>
    </c>
</record>
<record id="999999">
    <a>
        <detail>blah</detail>
        foo
        <detail>blah blah</detail>
    </a>
    <b>
        <detail>blah</detail>
        ....
        <detail>blah</detail>
    </b>
    <c>
        <detail>blah</detail>
        ....
        <detail>blah</detail>
    </c>
</record>
</records>

and corresponding xsl ;

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">


<xsl:output method="xml"  />
<xsl:template match="records">
<xsl:element name="records">

<xsl:for-each select="record">
<xsl:element name="record">
<xsl:attribute name="id"><xsl:value-of select="@id" /></xsl:attribute>
<xsl:copy-of select="./a" />
</xsl:element>

</xsl:for-each>

</xsl:element>

</xsl:template>
</xsl:stylesheet>

the result of

xsltproc extract.xsl  record.xml

would be

<?xml version="1.0"?>
<records><record id="1"><a>
        <detail>hello</detail>
        bar
        <detail>world</detail>
    </a></record><record id="999999"><a>
        <detail>blah</detail>
        foo
        <detail>blah blah</detail>
    </a></record></records>

Is this nearing what you expect ?

Archemar
  • 541
  • 1
  • 12
  • 22