10

I'm working on a complicated map-reduce process for a mongodb database. I've split some of the more complex code off into modules, which I then make available to my map/reduce/finalize functions by including it in my scopeObj like so:

  const scopeObj = {
    userCalculations: require('../lib/userCalculations')
  }

  function myMapFn() {
    let userScore = userCalculations.overallScoreForUser(this)
    emit({
      'Key': this.userGroup
    }, {
      'UserCount': 1,
      'Score': userScore
    })
  }

  function myReduceFn(key, objArr) { /*...*/ }

  db.collection('userdocs').mapReduce(
    myMapFn,
    myReduceFn,
    {
      scope: scopeObj,
      query: {},
      out: {
        merge: 'userstats'
      }
    },
    function (err, stats) {
      return cb(err, stats);
    }
  )

...This all works fine. I had until recently thought it wasn't possible to include module code into a map-reduce scopeObj, but it turns out that was just because the modules I was trying to include all had dependencies on other modules. Completely standalone modules appear to work just fine.

Which brings me (finally) to my question. How can I -- or, for that matter, should I -- incorporate more complex modules, including things I've pulled from npm, into my map-reduce code? One thought I had was using Browserify or something similar to pull all my dependencies into a single file, then include it somehow... but I'm not sure what the right way to do that would be. And I'm also not sure of the extent to which I'm risking severely bloating my map-reduce code, which (for obvious reasons) has got to be efficient.

Does anyone have experience doing something like this? How did it work out, if at all? Am I going down a bad path here?

UPDATE: A clarification of what the issue is I'm trying to overcome: In the above code, require('../lib/userCalculations') is executed by Node -- it reads in the file ../lib/userCalculations.js and assigns the contents of that file's module.exports object to scopeObj.userCalculations. But let's say there's a call to require(...) somewhere within userCalculations.js. That call isn't actually executed yet. So, when I try to call userCalculations.overallScoreForUser() within the Map function, MongoDB attempts to execute the require function. And require isn't defined on mongo.

Browserify, for example, deals with this by compiling all the code from all the required modules into a single javascript file with no require calls, so it can be run in the browser. But that doesn't exactly work here, because I need to be the resulting code to itself be a module that I can use like I use userCalculations in the code sample. Maybe there's a weird way to run browserify that I'm not aware of? Or some other tool that just "flattens" a whole hierarchy of modules into a single module?

Hopefully that clarifies a bit.

DanM
  • 7,037
  • 11
  • 51
  • 86
  • I don't know answer about accessing the modules here but would you be willing to consider alternative which is rewrite the map reduce code using aggregation framework. If yes, see if you can post the relevant code from map and reduce from other modules. More [here](https://stackoverflow.com/questions/13908438/is-mongodb-aggregation-framework-faster-than-map-reduce) – s7vr Mar 14 '18 at 16:39
  • @Veeram Unless I'm missing something, I don't think the Aggregation Framework will work for me -- I need to be able to do some pretty complex calculations in the `reduce` stage, and I also need to be able to do incremental updates (i.e., "merge" style output). – DanM Mar 14 '18 at 17:24
  • How much 'control' do you have over the dependency hierarchy? Means are there any hidden dependencies / 3rd party code? I am not sure if [this is the cause](https://nodejs.org/api/modules.html#modules_cycles) of your problem but it if it is, it could be tackled. – Jankapunkt Mar 15 '18 at 11:37
  • @Jankapunkt Please see my update above. – DanM Mar 15 '18 at 13:48
  • Okay I see. What if you wrap the require in a self executing function like so: `userCalculations: (function(){ return require('../lib/userCalculations') })()` ? It should resolve the required modules first. Only problem would then be to make sure, that this function itself is not executed before it's expected turn at runtime. – Jankapunkt Mar 15 '18 at 13:52
  • @Jankapunkt Alas, that still seems to only evaluate that particular require statement. If the first line of `userCalculations.js` is `const testModule = require('testModule')`, and `userCalculations.overallScoreForUser()` includes the statement `testModule.someExportedMethod()`, MongoDB returns the error `"testModule is not defined"`. – DanM Mar 15 '18 at 14:51
  • Can you please add some code example from the `userCalculations` file to your question? – Jankapunkt Mar 15 '18 at 15:13
  • userCalculations is just a standard node module. I can't share the code, but I'm only interested in a solution that would work with any such module (well, I don't expect to do filesystem operations or anything. For the sake of argument let's pretend I'm trying to include Lodash.) – DanM Mar 19 '18 at 15:21

1 Answers1

3

As a generic response, the answer to your question: How can I -- or, for that matter, should I -- incorporate more complex modules, including things I've pulled from npm, into my map-reduce code? - is no, you can not safely include complex modules in node code you plan to send to MongoDB for mapReduce jobs.

You mentioned the problem yourself - nested require statements. Now, require is sync, but if you have nested functions inside, these require calls would not be executed until call time, and MongoDB VM would throw at this point.

Consider the following example of three files: data.json, dep.js and main.js.

// data.json - just something we require "lazily"
false

// dep.js -- equivalent of your userCalculations
module.exports = {
  isValueTrue() {
    // The problem: nested require
    return require('./data.json');
  }
}


// main.js - from here you send your mapReduce to MongoDB.
// require dependency instantly
const calc = require('./dep.js');
// require is synchronous, the effectis the same if you do:
//   const calc = (function () {return require('./dep.js')})();

console.log('Calc is loaded.');
// Let's mess with unwary devs
require('fs').writeFileSync('./data.json', 'false');

// Is calc.isValueTrue() true or false here?
console.log(calc.isValueTrue());

As a general solution, this is not feasible. While vast majority of modules will likely not have nested require statements, HTTP calls, or even internal, service calls, global variables and similar, there are those who do. You cannot guarantee that this would work.

Now, as a your local implementation: e.g. you require exactly specific versions of NPM modules that you have tested well with this technique and you know it will work, or you published them yourself, it is somewhat feasible.

However, even if this case, if this is a team effort, there's bound to be a developer down the line who will not know where your dependency is used or how, use globals (not on purpose, but by ommission, e.g they wrongly calculate this) or simply not know the implications of whatever they are doing. If you have strong integration testing suite, you could guard against this, but the thing is, it's unpredictable. Personally I think that when you can choose between unpredictable and predictable, almost always you should use predictable.

Now, if you have an explicitly stated purpose for a certain library to be used in MongoDB mapReduce, this would work. You would have to guard well against ommisions and problems, and have strong testing infra, but I would make certain the purpose is explicit before feeling safe enough to do this. But of course, if you're using something that is so complex that you need several npm packages to do, maybe you can have those functions directly on MongoDB server, maybe you can do your mapReducing in something better suited for the purpose, or similar.

To conclude: As a purposefuly built library with explicit mission statement that it is to be used with node and MongoDB mapReduce, I would make sure my tests cover all my mission-critical and important functionality, and then import such npm package. Otherwise I would not use nor recommend this approach.

Zlatko
  • 18,936
  • 14
  • 70
  • 123
  • Thanks for all the information here (I was away for the weekend, hence the late response). You haven't addressed the concept of using something like browserify or webpack -- i.e. tools which compile a bunch of JS files including required dependencies into a single file. Do you have any thoughts on that, as relates to your answer? Thanks! – DanM Mar 19 '18 at 15:17
  • The problem comes down the same. Webpack doesn't and should not actually _call_ your methods. It basically says "if the code require's 'some-module', I will load this file, or show this already loaded stuff". So it cannot go deep through this. What you could in theory do is write, e.g. something like a babel/webpack plugin, that _walks the AST_, and if it finds `require` or `import` node, it calls this right away and puts its content instead. Maybe such a thing already exist. Writing such a plugin would be a super interesting task. Being accountable for it when it goes in production? Nope. – Zlatko Mar 19 '18 at 15:40
  • 1
    That very last point -- being accountable for it in production code -- is a very good one. Going to mark this as correct; you've convinced me to rethink this idea. – DanM Mar 19 '18 at 19:28