Parsing large xml files (1G+) in node.js

Question

I'm having a tough time finding a node package that can parse large xml files that are 1G+ in size. Our back-end server is primarily node.js, so I'd hate to have to build another service in another language/platform just to parse the xml and write data to a database. Has anyone had success doing this kind of thing in node? What did you use? I've looked at a bunch of packages like xml-stream, big-xml, etc, and they all have their own problems. Some can't even compile on mac (and seem outdated and no longer supported). I don't really need to convert the parsed results into js objects or anything like that. Just need to make sense of the data and then write to a database.

Yeah I'm also looking for something sensible to use with my [scramjet framework](https://www.npmjs.com/package/scramjet) - this could be something you may want to use in the later step, but it should be fed with something like "sax" processor... — Michał Karpacki, Sep 14 '18 at 12:42
@MichałKapracki yeah, I tried sax, but seems so darn slow and cumbersome to use. — u84six, Sep 14 '18 at 15:41
Hmm... strange. As far as I can remember, sax was actually faster than libxml. I don't have time now, but I'll check and try to couple some samples with scramjet as I wanted and post my findings on the parsers here... — Michał Karpacki, Sep 14 '18 at 15:49

score 16 · Answer 1 · answered Dec 31 '18 at 14:15

The most obvious, but not very helpful answer, is that it depends on the requirements.

In your case however it seems pretty straightforward; you need to load large chunks of data, that may or may not fit into memory, for simple processing before writing it to the database. I think this is good reason alone why you would want to externalise that CPU work as separate processes. So it would probably make more sense to first focus on which XML parser does the job for you rather than which Node wrapper you want to use for it.

Obviously, any parser that requires the entire document to be loaded into memory before processing is not a valid option. You will need to use streams for this and parsers that supports that kind of sequential processing.

This leaves you with a few options:

Saxon seems to have the highest level of conformance to the recent W3C specs, so if schema validation and such is important than that might be a good candidate. Otherwise both Libxml and Expat seems to stack up pretty well performance wise and comes already preinstalled on most operating systems.

The are Node wrappers available for all of these:

libxmljs – Libxml
xml-stream – Expat
node-expat – Expat
saxon-node – Saxon

My Node implementation would look something like this:

import * as XmlStream from 'xml-stream'
import { request } from 'http'
import { createWriteStream } from 'fs'

const xmlFileReadStream = request('http://external.path/to/xml')
const xmlFileWriteStream = new XmlStream(xmlFileReadStream)
const databaseWriteStream = createWriteStream('/path/to/file.csv')

xmlFileWriteStream.on('endElement: Person', ({ name, phone, age }) =>
  databaseWriteStream.write(`"${name}","${phone}","${age}"\n`))

xmlFileWriteStream.on('end', () => databaseWriteStream.end())

Of course I have no idea what your database write stream would look like, so here I am just writing it to a file.

hey :) if you didnt know 'person' node was existed - how you could parse this xml? — Roey Zada, Feb 03 '20 at 05:01

Parsing large xml files (1G+) in node.js

1 Answers1