3

i want to ask some info related findAndModify in MongoDB. As i know the query is "isolated by document".

This mean that if i run 2 findAndModify like this:

{a:1},{set:{status:"processing", engine:1}}
{a:1},{set:{status:"processing", engine:2}}

and this query potentially can effect 2.000 documents then because there are 2-query (2engine) then maybe that some document will have "engine:1" and someother "engine:2".

I don't think findAndModify will isolate the "first query". In order to isolate the first query i need to use $isolated.

Is everything write what i have write?

UPDATE - scenario

The idea is to write an proximity engine. The collection User has 1000-2000-3000 users, or millions.

1 - Order by Nearest from point "lng,lat" 2 - in NodeJS i make some computation that i CAN'T made in MongoDB 3 - Now i will group the Users in "UserGroup" and i write an Bulk Update

When i have 2000-3000 Users, then this process (from 1 to 3) take time. So i want to have Multiple Thread in parallel.

Parallel thread mean parallel query. This can be a problem since Query3 can take some users of Query1. If this happen, then at point (2) i don't have the most nearest Users but the most nearest "for this query" because maybe another query have take the rest of Users. This can create maybe that some users in New York is grouped with users of Los Angeles.

UPDATE 2 - scenario

I have an collection like this:

{location:[lng,lat], name:"1",gender:"m", status:'undone'}
{location:[lng,lat], name:"2",gender:"m", status:'undone'}
{location:[lng,lat], name:"3",gender:"f", status:'undone'}
{location:[lng,lat], name:"4",gender:"f", status:'done'}

What i should be able to do, is create 'Group' of users by grouping by the most nearest. Each Group have 1male+1female. In the example above, i'm expecting to have only 1 group (user1+user3) since there are Male+Female and are so near each other (user-2 is also Male, but is far away from User-3 and also user-4 is also Female but have status 'done' so is already processed).

Now the Group are created (only 1 group) so the 2users are marked as 'done' and the other User-2 is marked as 'undone' for future operation.

I want to be able to manage 1000-2000-3000 users very fast.

UPDATE 3 : from community Okay now. Can I please try to summarise your case. Given your data, you want to "pair" male and female entries together based on their proximity to each other. Presumably you don't want to do every possible match but just set up a list of general "recommendations", and let's say 10 for each user by the nearest location. Now I'd have to be stupid to not see the full direction of where this is going, but does this sum up the basic initial problem statement. Process each user, find their "pairs", mark them as "done" once paired and exclude them from other pairings by combination where complete?

Community
  • 1
  • 1
Daniele Tassone
  • 2,104
  • 2
  • 17
  • 25
  • Well there you go. `findAnModify()` only ever updates one document at a time. `$isoloted` does not make "two" modifications happen together, but only affects each update individually. You are looking for "transactions" and there are no "transactions" in MongoDB. If anything, you probably want ["Bulk"](http://docs.mongodb.org/manual/reference/method/Bulk/) operations. But no form of "Bulk" or "multi" update ever returns a document, as it just would not make any sense. – Blakes Seven Aug 29 '15 at 10:01
  • In fact it really sounds like you want "Ordered" Bulk updates. But your actual structure ( presuming a loop ) is not really clear here. – Blakes Seven Aug 29 '15 at 10:03
  • 1
    Why are you running these two operations in close proximity? You are basically running one query to set engine 1 and then another just to set engine 2 in quick succesion, not sure why – Sammaye Aug 29 '15 at 10:09
  • This operation came from different thread, I can't use Bulk. – Daniele Tassone Aug 29 '15 at 10:45
  • Sammaye, I have a number of "n" nodejs process called "engine". Each engine will take the responsibility of the query. If 2 engine make the same query at same time I don't know, but can happen. Also me Update will have GeoNear, so I need to avoid the 2-engine divide/split the GeoNear documents. – Daniele Tassone Aug 29 '15 at 10:50
  • Imagine that I have an city with 200 users. The Engine1 will take all the 200 users of is isolated. Otherwise Engine1 will take 100 users and maybe Eninge2 will take another 100 users. If this happen, I have divided the City in 2 group and this is bad. – Daniele Tassone Aug 29 '15 at 10:52
  • @Dada if it comes from a different thread how do you know which query is the one that's supposed to be run even with isolation and transactions? – Sammaye Aug 29 '15 at 10:56
  • Blakes, I need to run in Parallel 10 engine query (in 10 threads). Is difficult that the query happen at same time, but I want to prevent this scenario. Another solution would be manage another collection called "EngineQueue" where each engine will check if can run or not (maybe another query is still working). In this way I will make the same result without using $isolated – Daniele Tassone Aug 29 '15 at 10:56
  • I am not sure I am following you, you say: " If 2 engine make the same query at same time I don't know, but can happen" as though it should be rare but then you say " If this happen, I have divided the City in 2 group and this is bad." as though it will happen regardless. I am having trouble following your scenario here – Sammaye Aug 29 '15 at 10:58
  • @Dada. See the `@` there. That is called tagging here. If you want to talk to someone then do what I did. Can you please edit your question as it does not make any sense. You need to state your clear case in your question and not in your comments. Stop thinking about `$isolated` and start saying what you really need to do. Cannot be more clear here that `$isolated` is **not** what you want. – Blakes Seven Aug 29 '15 at 11:11
  • @BlakesSeven, question updated. – Daniele Tassone Aug 29 '15 at 12:09
  • @Dada Sorry but that is not a clear description to me. Please state what you are "actually need to do" . I don't want to see "Parallel operation" or "isolated" or anything like that. Just a business case that needs to be solved. This is why you are getting it wrong, as you are not objectively looking at what needs to really be achieved. Clear dot points of process and what needs to happen where are when, and who needs to to what. Think "Observers" – Blakes Seven Aug 29 '15 at 12:12
  • 1
    @BlakesSeven i don't know how to write it as you want. But i appreciate your time. – Daniele Tassone Aug 29 '15 at 12:15
  • It's simple. Just write out 1. The intended result to the observer using the data. 2. Update processes that need to happen. 3. What the observer should see while any updates are happening. But be descriptive on "2" without launching into the presumptions you already have. Just explain the updates that need to be applied and any cycle or selection involved. This is how you describe problems to peers. So consider this a learning excercise. – Blakes Seven Aug 29 '15 at 12:18
  • Okay now. Can I please try to summarise your case. Given your data, you want to "pair" male and female entries together based on their proximity to each other. Presumably you don't want to do every possible match but just set up a list of general "recommendations", and let's say 10 for each user by the nearest location. Now I'd have to be stupid to not see the full direction of where this is going, but does this sum up the basic initial problem statement. Process each user, find their "pairs", mark them as "done" once paired and exclude them from other pairings by combination where complete? – Blakes Seven Aug 29 '15 at 12:54
  • Now. IMHO, my little summary there appears to be a lot more descriptive of the problem you are facing than all of your efforts to describe it to date. Can you please then just update your question to be pretty much exactly that. As that is a problem that can be solved and described to you. – Blakes Seven Aug 29 '15 at 12:56
  • @BlakesSeven yes is exact. I will update the question. – Daniele Tassone Aug 29 '15 at 13:27

1 Answers1

2

This is a non-trivial problem and can not be solved easily.

First of all, an iterative approach (which admittedly was my first one) may lead to wrong results.

Given we have the following documents

{
   _id: "A",
   gender: "m",
   location: { longitude: 0, latitude: 1 }
 }

 {
   _id: "B",
   gender: "f",
   location: { longitude: 0, latitude: 3 }
 }

 {
   _id: "C",
   gender: "m",
   location: { longitude: 0, latitude: 4 }
 }

 {
   _id: "D",
   gender: "f",
   location: { longitude: 0, latitude: 9 }
 }

With an iterative approach, we now would start with "A" and calculate the closest female, which, of course would be "B" with a distance of 2. However, in fact, the closest distance between a male and a female would be 1 (distance from "B" to "C"). But even when we found this, that would leave the other match, "A" and "D", at a distance of 8, where, with our previous solution, "A" would have had a distance of only 2 to "B".

So we need to decide what way to go

  1. Naively iterate over the documents
  2. Find the lowest sum of distances between matching individuals (which itself isn't trivial to solve), so that all participants together have the shortest travel.
  3. Matching only participants within an acceptable distance
  4. Do some sort of divide and conquer and match participants within a certain radius of a common landmark (say cities, for example)

Solution 1: Naively iterate over the documents

var users = db.collection.find(yourQueryToFindThe1000users);

// We can safely use an unordered op here,
// which has greater performance.
// Since we use the "done" array do keep track of
// the processed members, there is no drawback.
var pairs = db.pairs.initializeUnorderedBulkOp();

var done = new Array();

users.forEach(
  function(currentUser){

     if( done.indexOf(currentUser._id) == -1 ) { return; }

     var genderToLookFor = ( currentUser.gender === "m" ) ? "f" : "m";

     // using the $near operator,
     // the returned documents automatically are sorted from nearest
     // to farest, and since findAndModify returns only one document
     // we get the closest matching partner.
     var nearPartner = db.collection.findAndModify(
       query: {
         status: "undone",
         gender: genderToLookFor,
         $near: {
           $geometry: {
             type: "Point" ,
             coordinates: currentUser.location
           }
         }
       },
       update: { $set: { "status":"done" } },
       fields: { _id: 1}
     );

     // Obviously, the current use already is processed.
     // However, we store it for simplifying the process of
     // setting the processed users to done.
     done.push(currentUser._id, nearPartner._id);

     // We have a pair, so we store it in a bulk operation
     pairs.insert({
       _id:{
         a: currentUser._id,
         b: nearPartner._id
       }
     });

  }
)

// Write the found pairs
pairs.execute();

// Mark all that are unmarked by now as done
db.collection.update(
  {
    _id: { $in: done },
    status: "undone"
  },
  {
    $set: { status: "done" }
  },
  { multi: true }
)

Solution 2: Find the smallest sum of distances between matches

This would be the ideal solution, but it is extremely complex to solve. We need to all members of one gender, calculate all distances to all members of the other gender and iterate over all possible sets of matches. In our example it is quite simple, since there are only 4 combinations for any given gender. Thinking of it twice, this might be at least a variant of the traveling salesman problem (MTSP?). If I am right with that, the number of combinations should be

number of combinations for all n>2, where n is the number of possible pairs.

and hence

combinations for n=10 for n=10

and an astonishing

combinations for n=25 for n=25

That's 7.755 quadrillion (long scale) or 7.755 septillion (short scale). While there are approaches to solving this kind of problem, the world record is somewhere in the range of 25,000 nodes using massive amounts of hardware and quite tricky algorithms. I think for all practical purposes, this "solution" can be ruled out.

Solution 3

In order to prevent the problem that people might be matched with unacceptable distances between them and depending on your use case, you might want to match people depending on their distance to a common landmark (where they are going to meet, for example the next bigger city).

For our example assume we have cities at [0,2] and [0,7]. The distance (5) between the cities hence has to be our acceptable range for matches. So we do a query for each city

db.collection.find({
 $near: {
   $geometry: {
     type: "Point" ,
     coordinates: [ 2 , 0 ]
   },
   $maxDistance: 5
 }, status: "done"
})

and iterate over the results naively. Since "A" and "B" would be the first in the result set, they would be matched and done. Bad luck for "C" here, as no girl is left for him. But when we do the same query for the second city he gets his second chance. Ok, his travel gets a bit longer, but hey, he got a date with "D"!

To find the respective distances, take a fixed set of cities (towns, metropolitan areas, whatever your scale is), order them by location and set each cities radius to the bigger of the two distances to their immediate neighbors. This way, you get overlapping areas. So even when a match can not be found in one place, it may be found on others.

Iirc, Google Maps allows it to grab the cities of a nation based on their size. An easier way would be to let people choose their respective city.

Notes

  1. The code shown is not production ready and needs to be refined.
  2. Instead of using "m" and "f" for denoting a gender, I suggest using 1 and 0: Can still be easily mapped, but needs less space to save.
  3. Same goes for status.
  4. I think the last solution is the best, optimizing distances some wayish and keeping the chances high for a match.
Markus W Mahlberg
  • 19,711
  • 6
  • 65
  • 89
  • the solution 3 is what i was following. Is not thread-safe: if i run multiple thread in order to speed-up the code than i will have unpredictable behavior since multiple 'find' will create multiple-group of the same Users! – Daniele Tassone Aug 30 '15 at 10:06
  • 1
    Instead of using threads, I'd rather [use an job queue](https://medium.com/node-js-tips-tricks/implementing-a-job-queue-with-node-js-ffcfbc824b01) for simplicity and MongoDB's [findAndModify](http://docs.mongodb.org/manual/reference/method/db.collection.findAndModify/) when looking for a matching partner, setting the status to "done" and returning the new document in one atomic operation. Problem solved. – Markus W Mahlberg Aug 30 '15 at 11:22
  • so in This way I can also use an Job Queue with multiple VIrtualMachine. I just need to "set" the JobQueue with some "machineId" and when it finish then the next JobQueue is fired (maybe for the same machine or for other machine). What do you think? – Daniele Tassone Aug 30 '15 at 12:54
  • @Dada IMHO, KISS should be your primary concern. – Markus W Mahlberg Aug 30 '15 at 14:33
  • what you mean with KISS? – Daniele Tassone Aug 30 '15 at 15:30
  • i was thinking "KISS.js" framework that why i ask you to be more clear. Related to Keep It Simple Stupid, i don't know how i can take benefit since my question is very specific about if Job Queue can be also an solution to sync more VirtualMachine, i think so. – Daniele Tassone Aug 31 '15 at 12:36
  • My approach was "keep it simple and starightforward". With kue, [basic "parallel" processing is as easy as easy as adding a parameter to the "process" call](https://github.com/Automattic/kue#processing-concurrency). Less code to maintain and fewer moving parts (=KISS) than [a full blown cluster configuration](https://github.com/Automattic/kue#parallel-processing-with-cluster). – Markus W Mahlberg Aug 31 '15 at 13:07