5

During the development of our map-reduce jobs our MR code generates useful diagnostic data structures independently of the data being map-reduced. Is there an easy way to get these data out to the code that called mapReduce or to persist them in Mongo? Just writing to the log file is turning out to be very sub-optimal as (a) there is a lot of data there already and (b) our diagnostic info is highly structured and, in fact, we'd like to run queries against it.

My investigation so far suggests that MR data structures are passed by value (via serialization) so any in-memory data structures are lost, including those hooked to the "global" scope. The namespaces are isolated from the main JS server-side namespace so dbeval can't seem to reach them (or, at least, I don't know where to look). Last but not least, although all the database objects and functions are present, 10gen is generating (confusing) error messages to prevent their use, e.g., about coll.insert not being a function while typeof coll.insert === 'function' is true.

To be clear, I'm interested in doing this for development in a single node, because the logging/debugging support in MongoDB is pretty limited. This type of side-effects are not good in production environments.

Sim
  • 13,147
  • 9
  • 66
  • 95
  • Exactly what kind of data is it generating? The output of MR can be persisted in a collection..In C#, I specify it as: MyInputCollection.MapReduce(map, reduce, MapReduceOptions.SetOutput("MyOutputCollection")); You then read from the persistent collection 'MyOutputCollection'. – Aafreen Sheikh Aug 01 '12 at 07:42
  • I want to save information that is independent of the MR output. Think of it as data exhaust, e.g., for detailed structured logging/benchmarking information that I want to process with code so I don't want it to end up in the log files. – Sim Aug 02 '12 at 01:07
  • Have you tried using a capped collection for logging.. I don't quite understand why coll.insert should fail.. – Aafreen Sheikh Aug 02 '12 at 05:19
  • 1
    @AafreenSheikh insert() fails because ad hoc DB operations are disabled during map-reduce. 10gen must have done it to control the environment. – Sim Aug 06 '12 at 04:05

1 Answers1

2

As surmised, it is not possible (as at MongoDB 2.2) to access another DB from within the Map/Reduce functions. Aside from potential performance impact, there is also the possibility of creating deadlocks and other unwanted side-effects.

Unfortunately that leaves print() to the mongo log as the only "out of band" output option.

Depending on your diagnostic output, one approach to try would be:

  • add a unique marker that would allow you to identify the output (or even the output run) in the log output

  • serialize your output using tojson() so it is logged with some parseable structure and ideally emitted on a single line when you print()

  • write a script to tail the mongod.log log for lines matching with your unique marker and insert those into another collection for reporting

Example of code that will run from within a M/R function:

var diag = {
    'run' : diagrun,
    'phase': 'map',
    'key'  : z
}   
print("MAPDIAG:" + tojson(diag));

Example output:

$ tail -f mongo.log | grep "^MAPDIAG"
MAPDIAG:{ "run" : "20120824", "phase" : "map", "key" : "dog" }
MAPDIAG:{ "run" : "20120824", "phase" : "map", "key" : "cat" }
MAPDIAG:{ "run" : "20120824", "phase" : "map", "key" : "cat" }
MAPDIAG:{ "run" : "20120824", "phase" : "map", "key" : "mouse" }
MAPDIAG:{ "run" : "20120824", "phase" : "map", "key" : "cat" }
MAPDIAG:{ "run" : "20120824", "phase" : "map", "key" : "dog" }
MAPDIAG:{ "run" : "20120824", "phase" : "reduce", "key" : "cat", "total" : 3 }
MAPDIAG:{ "run" : "20120824", "phase" : "reduce", "key" : "dog", "total" : 2 }
Stennie
  • 63,885
  • 14
  • 149
  • 175
  • this is more or less what we ended up doing. I built a logger class that logs to a collection and uses print(). During MR, the log collection inserts generate exceptions which are swallowed. I wish 10gen paid more attention to development/debugging support. – Sim Aug 25 '12 at 00:48
  • @Sim: would be helpful if you could create a [Jira issue](https://jira.mongodb.org/browse/SERVER) in the MongoDB tracker (SERVER queue, component 'MapReduce/Distinct/Group') with some more information on what would be needed/useful for debugging. Perhaps something similar to Hadoop [MapReduce Counters](http://diveintodata.org/2011/03/15/an-example-of-hadoop-mapreduce-counter/), with an optional logging output callback at the end of the run. – Stennie Aug 25 '12 at 01:48