1

I have a single directory in a windows machine with 3.5 million json files ranging from 3 to 30kb. I have some function:

myBuilder(json){
       //some stuff producing an object named entry
       return entry
 }

All I want to do is read every file in the directory, run them through myBuilder and insert them all into the mongo database. I've posted my best attempt below.

What is the simplest way to achieve the desired result?

Notes:

  1. I have thought that insertMany may be complicated because I would need to build a way to break this operation into chunks as a single array holding the entries would exceed my available ram.
  2. I can't seem to get glob to work. Could it be a windows based limitation? Could it be a memory based limitation? Either way, for now I would like to avoid answers using glob.
  3. I would really appreciate someone explaining whether it makes more sense to run many consecutive insertOne operations inside a single instance of connecting to the database, or whether it is necessary to connect and disconnect each time.

SAMPLE CODE:

var fs = require('fs');
var mongodb = require('mongodb');
var MongoClient = mongodb.MongoClient;
var MongoURL = 'mongodb://localhost:27017/my_database_name';
traverseFileSystem('/nodejs/nodetest1/imports');

function traverseFileSystem(path){
    var files = fs.readdirSync(currentPath);
    for (var i in files) {
       var currentFile = path + '/' + files[i];
       var stats = fs.statSync(currentFile);
       if (stats.isFile()){
          var fileText = fs.readFileSync(currentFile,'utf8');
          var json= JSON.parse(fileText);
          var entry = myBuilder(json);    // note this is described above
          insertToMongo(entry);
       }
   }
}
function insertToMongo(entry){
    console.log(entry);
    MongoClient.connect(MongoURL, function (err, db) {
        var collection = db.collection('users');    
        collection.insert(entry, function (err, result) {
            if(err)
                console.log("error was"+err);
            else
                console.log("entry was"+result);
            db.close();
        });
    });
}

This passes (and logs to the console) well formatted entries for every file in the directory. But it does not display a positive error or a result for any entry. Mongo does show that connection is made and it does not display any errors.

COMisHARD
  • 867
  • 3
  • 13
  • 36
  • What's the actual problem to insert it one by one? – Alex Blex Dec 21 '16 at 14:51
  • I'm perfectly fine with a series of insertOne operations. My problem is that I can't seem to get this to actually work. Could it have something to do with the answer to note 3? – COMisHARD Dec 21 '16 at 14:53
  • ["It doesn't work" is not a problem statement](http://stackoverflow.com/help/mcve) – Alex Blex Dec 21 '16 at 14:56
  • I'l edit with a more specific code sample. But I was really thinking that this is a generic enough question that someone could provide a generic answer to the bolded question (not the question you asked, which indeed prompted me to give a non-SO appropriate answer) – COMisHARD Dec 21 '16 at 14:57
  • There are no generic problems with inserting documents into mongodb. You need to describe the exact problem you are facing. http://stackoverflow.com/help/how-to-ask – Alex Blex Dec 21 '16 at 15:09
  • I have now edited to show a more specific problem. – COMisHARD Dec 21 '16 at 15:11
  • If you use mongoose for Node, just create User schema and insert all data at once using method create. Ex: `User.create(entry).then(...)` – IARKI Dec 21 '16 at 15:31
  • Nothing's wrong with the code. It is not quite efficient, but it works. Check mongodb logs, turn on `db.setProfilingLevel(2)` to record all queries. – Alex Blex Dec 21 '16 at 15:40
  • Would you mind elaborating on comment 3? – COMisHARD Dec 21 '16 at 17:51
  • As I was saying I could not reproduce you problem. Did not try with 3.5m files, but with 2 small ones. With empty `myBuilder` it inserts both documents. Since you have a valid `entry` logged in `insertToMongo`, the problem lies somewhere else. If you enable profiler with level 2, you can check all operations reached the database by `db.system.profile.fin()`. I would also check mongodb logs, if there are any errors there. – Alex Blex Dec 22 '16 at 09:13

1 Answers1

0

You may want to reuse the db connection for all the inserts. The connection process consumes an amount of ms. you can save, specially if you have that large amount of files to import.

Regarding inserting one or many documents at once, you can use bulk operations. i.e. by using a loop, like: read 10 files, "bulk"-them and execute. Using the same db connection too.


In the case you could consider mongoimport:

You can use mongoimport from your terminal to import all .json files within a directory:

@echo off
for %%f in (*.json) do (
    "mongoimport.exe" --jsonArray --db databasename --collection collectioname --file %%~nf.json
)
Community
  • 1
  • 1
joseconstela
  • 727
  • 8
  • 12