2

I have two collections - shoppers (everyone in shop on a given day) and beach-goers (everyone on beach on a given day). There are entries for each day, and person can be on a beach, or shopping or doing both, or doing neither on any day. I want to now do query - all shoppers in last 7 days who did not go to beach.

I am new to Mongo, so it might be that my schema design is not appropriate for nosql DBs. I saw similar questions around join and in most cases it was suggested to denormalize. So one solution, I could think of is to create collection - activity, index on date, embed actions of user. So something like

{
   user_id
   date
   actions {
      [action_type, ..]
   }
}

Insertion now becomes costly, as now I will have to query before insert.

Shekhar
  • 67
  • 1
  • 7
  • why do you say you have to query before insert? query for what? – Asya Kamsky Jun 03 '12 at 19:08
  • 1
    also do you know about $nin operator? ("not in") – Asya Kamsky Jun 03 '12 at 19:09
  • @AsyaKamsky I think the Op means, to insert an action, it would first be necessary to query the right user. – McGarnagle Jun 03 '12 at 19:10
  • If I see a new action say 'on-beach', I will need to find the record for the user_id, and date to embed the action to existing data. – Shekhar Jun 03 '12 at 19:11
  • if that action is to be inserted into the right user document that would be a good idea :) but seems like that's unavoidable – Asya Kamsky Jun 03 '12 at 19:11
  • @AsyaKamsky yes I have seen $nin operator, but the input to is an array, and all the examples for the operator are with few elements in the array. So I am not sure if I should be using it for large arrays. – Shekhar Jun 03 '12 at 19:15
  • why would your array be large? your example stated two activities. – Asya Kamsky Jun 03 '12 at 19:20
  • I was referring using nin for people who did not go to beach, which will be large array to negate. Still thinking with my rdbms learnings :). – Shekhar Jun 03 '12 at 20:13
  • @Shekhar I would suggest adding some additional tags to this question - it was really about schema design not just specific query operation. Maybe schema-design tag? – Asya Kamsky Jun 03 '12 at 20:28

3 Answers3

3

A few of suggestions.

Figure out all the queries you'll be running, and all the types of data you will need to store. For example, do you expect to add activities in the future or will beach and shop be all?

Consider how many writes vs. reads you will have and which has to be faster.

Determine how your documents will grow over time to make sure your schema is scalable in the long term.

Here is one possible approach, if you will only have these two activities ever. One record per user per day.

{ user: "user1",
  date: "2012-12-01",
  shopped: 0,
  beached: 1
}

Now your query becomes even simpler, whether you have two or ten activities.

When new activity comes in you always have to update the correct record based on it. If you were thinking you could just append a record to your collection indicating user, date, activity then your inserts are much faster but your queries now have to do a LOT of work querying for both users, dates and activities.

With proposed schema, here is the insert/update statement:

db.coll.update({"user":"username", "date": "somedate"}, {"shopped":{$inc:1}}, true)

What that's saying is: "for username on somedate increment their shopped attribute by 1 and create it if it doesn't exist aka "upsert" (that's the last 'true' argument).

Here is the query for all users on a particular day who did activity1 more than once but didn't do any of activity2.

db.coll.find({"date":"somedate","shopped":0,"danced":{$gt:1}})

Be wary of picking a schema where a single document can have continuous and unbounded growth.

For example, storing everything in a users collection where the array of dates and activities keeps growing will run into this problem. See the highlighted section here for explanation of this - and keep in mind that large documents will keep getting into your working data set and if they are huge and have a lot of useless (old) data in them, that will hurt the performance of your application, as will fragmentation of data on disk.

Remember, you don't have to put all the data into a single collection. It may be best to have a users collection with a fixed set of attributes of that user where you track how many friends they have or other semi-stable information about them and also have a user_activity collection where you add records for each day per user what activities they did. The amount or normalizing or denormalizing of your data is very tightly coupled to the types of queries you will be running on it, which is why figure out what those are is the first suggestion I made.

Asya Kamsky
  • 41,784
  • 5
  • 109
  • 133
  • there are more than 2 actions, plus the user itself is changing over time so when I was doing new record for every action, I was also dumping the current user attribute with it to be able to query if there is any correlation of what action the user does to the users attribute like "how many friends he has". – Shekhar Jun 03 '12 at 19:53
  • Users can have more than two activity counters per date. You do *not* want a new record for every action, you would just change (increment) that users' record for that particular day. I'll add an example update. – Asya Kamsky Jun 03 '12 at 19:59
  • hmm, I can probably keep both counter for easy query and then record for each action to do more details per action like shopped_at. – Shekhar Jun 03 '12 at 20:21
  • @Shekhar yes! now you're thinking the noSQL way! pre-compute/pre-aggregate as necessary to keep your queries fast but have flexibility to keep raw data for additional future queries :) – Asya Kamsky Jun 03 '12 at 20:26
0

Insertion now becomes costly, as now I will have to query before insert.

Keep in mind that even with RDBMS, insertion can be (relatively) costly when there are indices in place on the table (ie, usually). I don't think using embedded documents in Mongo is much different in this respect.

For the query, as Asya Kamsky suggest you can use the $nin operator to find everyone who didn't go to the beach. Eg:

db.people.find({ 
    actions: { $nin: ["beach"] }
});

Using embedded documents probably isn't the best approach in this case though. I think the best would be to have a "flat" activities collection with documents like this:

{
    user_id
    date
    action
}

Then you could run a query like this:

var start = new Date(2012, 6, 3);
var end = new Date(2012, 5, 27);
db.activities.find({ 
    date: {$gte: start, $lt: end }, 
    action: { $in: ["beach", "shopping" ] } 
});

The last step would be on your client driver, to find user ids where records exist for "shopping", but not for "beach" activities.

McGarnagle
  • 101,349
  • 31
  • 229
  • 260
  • Usually if I am doing inserts like this on RDBMS, I will do bulk upload so yes the cost exists but I can optimize based on grouping bunch of updates. Also, what happens if the query becomes little more complex - find users who went to beach only once in last 3 days. – Shekhar Jun 03 '12 at 19:25
  • @Shekhar that does get more complex. I think that one might involve a map-reduce query, depending on how you end up designing the schema. – McGarnagle Jun 03 '12 at 19:43
  • I don't think you need map/reduce if you use a schema where you just increment a counter for a user/activity on a particular day. – Asya Kamsky Jun 03 '12 at 19:57
0

One possible structure is to use an embedded array of documents (a users collection):

{
    user_id: 1234,
    actions: [ 
        { action_type: "beach", date: "6/1/2012" },
        { action_type: "shopping", date: "6/2/2012" }
    ]
},
{ another user }

Then you can do a query like this, using $elemMatch to find users matching certain criteria (in this case, people who went shopping in the last three days:

var start = new Date(2012, 6, 1);
db.people.find( { 
    actions : { 
        $elemMatch : { 
            action_type : { $in: ["shopping"] }, 
            date : { $gt : start } 
        } 
    } 
});

Expanding on this, you can use the $and operator to find all people went shopping, but did not go to the beach in the past three days:

var start = new Date(2012, 6, 1);
db.people.find( {  
    $and: [
        actions : { 
            $elemMatch : { 
                action_type : { $in: ["shopping"] }, 
                date : { $gt : start } 
            } 
        },
        actions : { 
            $not: {
                $elemMatch : { 
                    action_type : { $in: ["beach"] }, 
                    date : { $gt : start } 
                } 
            }
        }
    ]
});
McGarnagle
  • 101,349
  • 31
  • 229
  • 260
  • I would not recommend this due to performance issues with constantly growing documents. It's better to add documents and query the latest ones than keep continually growing a set of documents. – Asya Kamsky Jun 03 '12 at 19:56
  • looks like this might work, as I can also dump users attribute for each day-action pair to do my other query around "what kind of users are more likely to do this action". Does it matter how many documents get embedded on query performance? I can potentially million user_ids, and each has say 30 actions every-day. @AsyaKamsky yes, exactly my question - I worry that this might performance issue. – Shekhar Jun 03 '12 at 19:57
  • @AsyaKamsky What about indices on embedded fields? http://www.mongodb.org/display/DOCS/Indexes#Indexes-IndexingonEmbeddedFields%28%22DotNotation%22%29 – McGarnagle Jun 03 '12 at 20:00
  • @Shekhar I think you can set up indexing for this kind of schema and query. See here: http://www.mongodb.org/display/DOCS/Multikeys – McGarnagle Jun 03 '12 at 20:04
  • the problem is the array will have very old (and probably stale and unnecessary) data in it as time goes by. You will keep adding new dates but old data will keep getting pulled into memory every time you fetch a particular document from disk it's bigger. Imagine this collection in six months - each user document has grown from having one or two array elements on the first day to several hundred. Every time a document grows mongo has to relocate it so now your data will be highly fragmented on disk as well. This is good discussion though so I'll add some of my comments to my answer. – Asya Kamsky Jun 03 '12 at 20:10
  • @AsyaKamsky yes, good point. I'd be very curious to see how you could do Shekhar's complex queries another way, though. – McGarnagle Jun 03 '12 at 20:26
  • there will always be very complex queries that may require aggregation framework, map/reduce or application logic, but I think the documents' unbounded growth will add performance issues to all of them. – Asya Kamsky Jun 03 '12 at 20:31