0

I have to process a large XML file (around 25 mb in size), and organize the data into documents to import into MongoDB.

The issue is, there are around 5-6 types of elements in the xml document, each with around 10k rows.

After fetching one xml node of type a, I have to fetch it's corresponding elements of types b,c,d, etc.

What I am trying to do in node:

  1. Fetch all the rows of type a.
  2. For each row, using xpath, find its corresponding related rows, and create the document.
  3. Insert document in mongodb

If there are 10k rows of type a, the 2nd step runs 10k times. I am trying to get this to run in parallel so that the thing doesn't take forever. Hence, async.forEach seemed to be the perfect solution.

async.forEach(rowsA,fetchA);

My fetchrelations function is sort of like this

var fetchA = function(rowA) {
//covert the xml row into an object 
    var obj = {};
    for(i in rowA.attributes) {
    attribute = rowA.attributes[i];
    if(attribute.value === undefined) 
        continue;
    obj[attribute.name] = attribute.value;
    }
    console.log(obj.someattribute);
    //first other related rows, 
    //callback inserts the modified object with the subdocuments
    findRelations(obj,function(obj){
        insertA(obj,postInsert);
    });
};

After I try to run this, the console.log in the code only runs about once in every 1.5 seconds, not parallely for every row as I expected. I have been scratching my head and trying to figure this out for the past two hours, but I am not sure what I am doing wrong.

I am not very adept with node, so please be patient.

Munim
  • 6,310
  • 2
  • 35
  • 44
  • Is using Node a hard requirement? Your toolkit is totally your call (and I think Node is great) but processing large XML files isn't its core use case so the problem will likely be harder than it need be. For example, there are much more mature tools for this on the JVM: https://gist.github.com/1876390 – Richard Marr Jan 04 '13 at 15:13
  • @RichardMarr I am not very familiar with Groovy, but I will keep your point in mind. I have still not completely decided on the implementation, and we are pretty platform agnostic. I was giving node a shot, because it is one of the popular platforms used in my organisation. – Munim Jan 04 '13 at 18:53
  • 1
    the use of Groovy is entirely optional, the code just needs a few tweaks to make it valid Java. Any other JVM language would be fine too. The main point to take away is that the libraries for processing large amounts of streaming XML on the JVM matured years ago, so you'll have a better likelihood of performance, stability, documentation, etc. – Richard Marr Jan 06 '13 at 12:18

1 Answers1

1

It looks to me like you're not declaring and calling the callback function which async will pass to your iterator function (fetchA). See the forEach documentation for an example.

Your code probably needs to look more like...

var fetchA = function(rowA, cb) {
//covert the xml row into an object 
    var obj = {};
    for(i in rowA.attributes) {
    attribute = rowA.attributes[i];
    if(attribute.value === undefined) 
        cb();
    obj[attribute.name] = attribute.value;
    }
    console.log(obj.someattribute);
    //first other related rows, 
    //callback inserts the modified object with the subdocuments
    findRelations(obj,function(obj){
        insertA(obj,postInsert);
        cb();  // You may even need to call this within insertA or portInsert if those are asynchronous functions.
    });
};
tomato
  • 3,373
  • 1
  • 25
  • 33
  • I thought that the callback function was optional, and only required if you need to run some code after all the iterators have completed execution. But you may be right. I have to test it out and get back to you. – Munim Jan 04 '13 at 18:57
  • You're thinking of the optional callback you can declare _after_ the function you want to run on each item. Async passes a separate callback into your iterator function which you have to call to tell async you're done. This is a pretty common idiom in Node. – tomato Jan 04 '13 at 22:58
  • I am trying to figure out the best place to add it, to make sure the code gets executed asynchronously and as fast as possible. I have refactored findRelations to just make it work like `findRelations(obj,insertA)`. Now, I can call the callback either before or after this. I've been experimenting with both, but it doesn't seem to be executing asynchronously still. It is still still and processing at one record per 1 second or so. – Munim Jan 07 '13 at 05:46