How to parse simple xml file

Question

>> ? xml
No information on xml

There's parse-xml but it seems to me that it was for Rebol2.

I've searched for xml scripts in rebol.org and found xml-object.r that seemed to me like the most up to date from all searches.

I know about altxml, too, but the examples given are for html.

So, I'd like to ask about my choices if I want to parse and use information of +1GB of files of this simplified structure:

<?xml version="1.0" encoding="Windows-1252" standalone="yes"?>
<SalesFile xmlns="urn:StandardSalesFile-1.0">
    <Header>
        <SalesFileVersion>1.01</SalesFileVersion>
        <DateCreation>2014-04-30</DateCreation>
    </Header>
    <SalesInvoices>
        <Invoice>
            <InvoiceNo>INV 1/1</InvoiceNo>
            <DocumentStatus>
                <InvoiceStatus>N</InvoiceStatus>
                <InvoiceStatusDate>2014-01-03T17:57:59</InvoiceStatusDate>
            </DocumentStatus>
        </Invoice>
        <Invoice>
            <InvoiceNo>INV 2/1</InvoiceNo>
            <DocumentStatus>
                <InvoiceStatus>N</InvoiceStatus>
                <InvoiceStatusDate>2014-01-03T17:59:12</InvoiceStatusDate>
            </DocumentStatus>
        </Invoice>
    </SalesInvoices>
</SalesFile>

Is Rebol3 going to have a parse-xml tool? Should I use xml-object? If so how? Because it's still beyong my novice level of the language. Other option?

HostileFork says dont trust SE · Answer 1 · 2014-05-30T10:51:09.310

Do you really need to deal with the XML file as structure? If not, have you considered just using PARSE?

(Warning: the following is untested, I'm just presenting the concept.)

Invoices: copy []

parse my-doc [
    <?xml version="1.0" encoding="Windows-1252" standalone="yes"?>

    thru <SalesFile xmlns="urn:StandardSalesFile-1.0">

    thru <Header>
        thru <SalesFileVersion> copy SalesFileVersion to </SalesFileVersion> 
        </SalesFileVersion>

        thru <DateCreation> copy DateCreation to </DateCreation>
        </DateCreation>
    thru </Header>

    thru <SalesInvoices>

    any [
        thru <Invoice>

        (Invoice: object [])

        thru <InvoiceNo> copy InvoiceNo to </InvoiceNo>
        </InvoiceNo>

        (Invoice/No: InvoiceNo)

        thru <DocumentStatus>
            thru <InvoiceStatus> copy InvoiceStatus to </InvoiceStatus>
            </InvoiceStatus>

            (Invoice/Status: InvoiceStatus)

            thru <InvoiceStatusDate> copy InvoiceStatusDate to </InvoiceStatusDate>
            </InvoiceStatusDate>

            (Invoice/StatusDate: InvoiceStatusDate)

        thru </DocumentStatus>

        thru </Invoice>
    ]

    thru </SalesInvoices>

    thru </SalesFile>

    to end
]

If you know you have well-formed XML and don't want a dependency on a library for processing clunky-ol' XML, Rebol can get pretty far and clear with PARSE. As TAG! is just a subclass of string, you can make things look relatively literate. And it's much more lightweight to just work with the strings.

Though if structural manipulations are required, you'll need something that makes a DOM. Altxml is the go-to right now, AFAIK.

(Hmm...I had a name for the pattern copy x to <foo> <foo> that escapes me at the moment, but this is a good case for it.)

You're looking for `parse [ something ] [ copy until ]` but your proposal hasn't been implemented yet — HappySpoon, May 30 '14 at 07:38
@HostileFork: I tried a simplified version of this script with a 1.5GB xml file. Basically, its `rule` is `thru ` any [thru (invoice_count: invoice_count + 1 print [invoice_count])`. The problem is that the `` is very far from the beginning. And I get a `Internal error: not enough memory`. Is there any possibility of a command like `discard thru`, so that those characters are sent to `/dev/null` or something and not archived in some `C` array? — Luis, Jun 06 '14 at 09:26
@Luis Did you actually successfully load the string? I can't get a string of just "a" of length 1610612736. :-/ You should be able to seek THRU or TO any pattern in a string without incurring cost proportional to the input skipped *(as long as you are not doing COPY X THRU or COPY Y TO in order to save it as you go)*. You may just have too big a file to be expecting success as a string fully loaded in memory at this time. Perhaps you could have a preprocessing step that runs line by line on the I/O to generate an intermediate file with abbreviations `~~` for ``, etc.~~ — HostileFork says dont trust SE, Jun 06 '14 at 19:27

score 2 · Answer 2 · answered May 30 '14 at 00:14

There is also a Rebol 3 library by Christopher Ross-Gill called alt-xml.

http://www.ross-gill.com/page/XML_and_REBOL

This can translate the XML to either a block! or object! representation.

Your question states that these XML files are large and may not fit in main memory. I would suggest that creating 1GB XML files is not best practice as many parsers, including this one, do attempt to load the files into memory.

To support this you will have to chunk the files yourself by using open on the file and copy/part chunks out of the file. This is a bit messy, but it will work.

One way to make this cleaner is to use parse as per HostileFork's suggestion and modify the series as you parse it. Parse is very flexible in this regard.

Ideally parse would be able to work directly on port! objects, but this is only a future wish list item at the moment.

The size of the xml files doesn't depend on me. Some of them have dozen or hundreds of GBs, so this library won't be useful for such files. Only for smaller ones. I'll check how much on my system. — Luis, May 30 '14 at 13:37

score 1 · Answer 3 · answered Apr 05 '21 at 20:45

%Rebol-Dom.r or %rebol-dom-mdlparser.r, If your willing to use rebol2 with parse to seek thru to the node-name, then copy a chunk of data, you could feed that to Rebol-Dom.r getnodename "salesInvoice" and append that node-element to a block repetitively.

How to parse simple xml file

3 Answers3