6

I have two collections which are in what we would call a "one to one relationship" if this were a relational database. I do not know why one is not nested within the other, but the fact is that for every document in collection "A", there is meant to be a document in collection "B" and vice versa.

Of course, in the absence of foreign key constraints, and in the presence of bugs, sometimes there is a document in "A" which does not have a related document in "B" (or vice versa).

I am new to MongoDB and I'm having trouble creating a query or script which will find me all the documents in "A" which do not have a related document in "B" (and vice versa). I guess I could use some sort of loop, but I don't yet know how that would work - I've only just started using simple queries on the RoboMongo command line.

Can anyone get me started with scripts for this? I have seen "Verifying reference (foreign key) integrity in MongoDB", but that doesn't help me. A bug has caused the "referential integrity" to break down, and I need the scripts in order to help me track down the bug. I also cannot redesign the database to use embedding (though I expect I'll ask why one document is not nested within the other).

I have also seen "How to find items in a collections which are not in another collection with MongoDB", but it has no answers.

Pseudo-Code for a Bad Technique

var misMatches = [];
var collectionB = db.getCollection('B');
var allOfA = db.getCollection('A').find();
while (allOfA.hasNext()) {
    var nextA = allOfA.next();
    if (!collectionB.find(nextA._id)) {
        misMatches.push(nextA._id);
    }
}
Community
  • 1
  • 1
John Saunders
  • 160,644
  • 26
  • 247
  • 397
  • 1
    The 'bad technique' sounds about right to me. What is wrong with it? The query in the loop should probably have an index, otherwise it might make sense to simply grab all data from the db (if possible), or use the one with larger cardinality as the 'root loop'. – mnemosyn May 23 '15 at 03:13
  • First, coming from a SQL Server background, I'm truly appalled at the idea that this may be the best I can do. I would have liked most of this work to be done in the database. I recognize that if the database design had nested B within A, this could never have happened. Still, assuming I have indexes on both `_id` fields, I'd have liked to get MongoDB to compare the two indexes for mismatches, rather than making me do all the work. – John Saunders May 23 '15 at 03:18
  • Hm, I see. I'm afraid that's not possible. There's no ACID 'C', i.e. consistency/invariant checks in MongoDB. – mnemosyn May 23 '15 at 03:21
  • I know there's no 'C'. I'm trying to check for inconsistencies after the fact. In fact, it's probably my program which is causing the inconsistency, and I need to see the pattern. I know of _one_ case of the inconsistency; I have one but to look for if there's only one bad document, a different bug to look for if _all_ documents are bad, and the worst bug to look for if it's more than one document but less than all documents. In _that_ case, I need to see the bad documents to see what they have in common. – John Saunders May 23 '15 at 03:23
  • I've been reading the MongoDB documentation and still don't see a way to do `!collectionB.find(nextA._id)`. – John Saunders May 23 '15 at 03:39
  • 2
    `collection.findOne() != null` tells you that a result was found or inversely has no match. Methods like `.find()` return a cursor, therefore the check is not logical. –  May 23 '15 at 05:15
  • @user3561036: ok, thanks, I was checking against `== undefined`. I don't suppose I can use `map` with a function that returns data if `!= null`, but doesn't return anything otherwise? The size of the input is always the size of the output? – John Saunders May 23 '15 at 05:29
  • Well you could use `collection.find().toArray().map(function(el) {.. } )` or language equivalent. But I'm not sure I see the point if all you are testing for is the "existence" of "related" data. I think the overall point as made by @mnemosyn originally is the "horrible looping code" is the only way to basically approach this since "joins" are not supported. Clearly an inherited problem, but it points to a possible case to "re-design" the structure of the data in storage. Or indeed reconsider storage engine choice. –  May 23 '15 at 08:04

1 Answers1

3

I don't know if this scales well, but ...

... given this sample date set:

> db.a.insert([{a:1},{a:2},{a:10}       ])
> db.b.insert([      {b:2},{b:10},{b:20}])
//             ^^^^^              ^^^^^^
//                inconsistent 1-to-1 relationship

You could use map-reduce to collect the set of key in a and merge it with the set of key from b:

mapA=function() {
  emit(this.a, {col: ["a"]})
}

mapB=function() {
  emit(this.b, {col: ["b"]})
}

reduce=function(key, values) {
  // merge both `col` arrays; sort the result
  return {col: values.reduce(
                 function(a,b) { return a.col.concat(b.col) }
                            ).sort()}
}

Producing:

> db.a.mapReduce(mapA, reduce, {out:{replace:"result"}})
> db.b.mapReduce(mapB, reduce, {out:{reduce:"result"}})
> db.result.find()
{ "_id" : 1, "value" : { "col" : [ "a" ] } }
{ "_id" : 2, "value" : { "col" : [ "a", "b" ] } }
{ "_id" : 10, "value" : { "col" : [ "a", "b" ] } }
{ "_id" : 20, "value" : { "col" : [ "b" ] } }

Then it is quite easy to find all id that wasn't found in collection a and b. In addition you should be able to spot duplicate keys in one or the other collection:

> db.result.find({"value.col": { $ne: [ "a", "b" ]}})
{ "_id" : 1, "value" : { "col" : [ "a" ] } }
{ "_id" : 20, "value" : { "col" : [ "b" ] }
Sylvain Leroux
  • 50,096
  • 7
  • 103
  • 125