0

In my mongodb collection i have 15 million documents with following json structure. The embedded document count inside playfields array field changes for each json document. All my queries involves filtering the documents based on data in playfields array field. All queries are taking more than 2 minutes to execute.

The value field in embedded document stores multiple datatypes (int, string). Is this bad design?

Am i doing anything wrong in writing queries? Am I missing any indexes? Do I have to move data from embed documents in single document to multiple collections?

The multiple conditions query(posted in question) takes 3 minutes to execute. Is the syntax wrong to use when filtering same collection? My goal is return docs that meets all these conditions.

If I break the query into piece each took ms to execute. 1) db.playfieldvalues.find({$or:[ {playfields: {$elemMatch:{ID:"Play.NHL.NHLAwayTeam" ,value: "NYI NEW YORK ISLANDERS"}}},{playfields: {$elemMatch:{ID:"Play.NHL.NHLAwayTeam" ,value: "T.B TAMPA BAY LIGHTNING"}}}]}) 2) db.playfieldvalues.find({playfields: {$elemMatch:{ID:"Play.NHL.NHLHomeTeam" ,value: "BOS BOSTON BRUINS"}}}) 3) db.playfieldvalues.find({playfields: {$elemMatch:{ID:"Play.NHL.NHLEventX" ,value: {$gt: 0, $lt: 25}}}}) 4) db.playfieldvalues.find({playfields: {$elemMatch:{ID:"Play.NHL.NHLEventScoreDifferential" ,value: {$gt: 0}}}})

enter image description here enter image description here

Index created:

db.collection.ensureIndex({ "playfields.ID": 1, "playfields.value": 1 })

queries running:

1:

db.playfieldvalues.find({playfields: {$elemMatch:{ID:"Play.NHL.NHLHomeTeam" ,value: "BOS BOSTON BRUINS"}}})

2:

db.playfieldvalues.find({$and:[
  {playfields: {$elemMatch:{ID:"Play.NHL.NHLHomeTeam" ,value: "BOS BOSTON BRUINS"}}},
  {$or:[ {playfields: {$elemMatch:{ID:"Play.NHL.NHLAwayTeam" ,value: "NYI NEW YORK ISLANDERS"}}},{playfields: {$elemMatch:{ID:"Play.NHL.NHLAwayTeam" ,value: "T.B TAMPA BAY LIGHTNING"}}}]},
  {playfields: {$elemMatch:{ID:"Play.NHL.NHLEventY" ,value: -38}}},
  {playfields: {$elemMatch:{ID:"Play.NHL.NHLEventX" ,value: {$gt: 0}}}},
  {playfields: {$elemMatch:{ID:"Play.NHL.NHLEventScoreDifferential" ,value: {$gt: 0}}}}
  ]
})

JSON Document sample:

{ 
    "_id" : ObjectId("59dbd4c5704aa82e70ac10b5"), 
    "playid" : "2594c658-aa3b-4a98-b2eb-0cc03e4dc9e5", 
    "playfields" : [
        {
            "ID" : "Play.NHL.NHLGameDate", 
            "TS" : "", 
            "value" : NumberInt(20160228)
        }, 
        {
            "ID" : "Play.GameDate", 
            "TS" : "", 
            "value" : "2/28/2016 12:00:00 AM"
        }, 
        {
            "ID" : "Play.NHL.NHLEventType", 
            "TS" : "", 
            "value" : "HIT"
        }, 
        {
            "ID" : "Play.NHL.NHLClockTime", 
            "TS" : "", 
            "value" : "03:08"
        }, 
        {
            "ID" : "Play.NHL.NHLClockTimeSeconds", 
            "TS" : "", 
            "value" : NumberInt(188)
        }, 
        {
            "ID" : "Play.NHL.NHLEventX", 
            "TS" : "", 
            "value" : NumberInt(62)
        }, 
        {
            "ID" : "Play.NHL.NHLEventY", 
            "TS" : "", 
            "value" : NumberInt(-38)
        }, 
        {
            "ID" : "Play.NHL.NHLEventPeriod", 
            "TS" : "", 
            "value" : "1"
        }, 
        {
            "ID" : "Play.NHL.NHLGameCode", 
            "TS" : "", 
            "value" : "20933"
        }, 
        {
            "ID" : "Play.NHL.NHLSeason", 
            "TS" : "", 
            "value" : "20152016"
        }, 
        {
            "ID" : "Play.NHL.NHLHomeTeam", 
            "TS" : "", 
            "value" : "BOS BOSTON BRUINS"
        }, 
        {
            "ID" : "Play.NHL.NHLAwayTeam", 
            "TS" : "", 
            "value" : "T.B TAMPA BAY LIGHTNING"
        }, 
        {
            "ID" : "Play.NHL.NHLPrimaryTeam", 
            "TS" : "", 
            "value" : "T.B TAMPA BAY LIGHTNING"
        }, 
        {
            "ID" : "Play.NHL.NHLPrimaryTeamActionPlayer", 
            "TS" : "", 
            "value" : "e27ca5e6-d4fa-4d45-8fa2-a860f64f7ea7"
        }, 
        {
            "ID" : "Play.NHL.NHLSecondaryTeam", 
            "TS" : "", 
            "value" : "BOS BOSTON BRUINS"
        }, 
        {
            "ID" : "Play.NHL.NHLSecondaryTeamActionPlayer", 
            "TS" : "", 
            "value" : "bea1deb6-aabd-47e8-b216-6f4df5f1ea97"
        }, 
        {
            "ID" : "Play.NHL.NHLEventZone", 
            "TS" : "", 
            "value" : "DZ"
        }, 
        {
            "ID" : "Play.NHL.NHLEventScoreDifferential", 
            "TS" : "", 
            "value" : NumberInt(1)
        }, 
        {
            "ID" : "Play.NHL.NHLEventStrength", 
            "TS" : "", 
            "value" : "Even"
        }
    ]
}

Attached explain output of second query: enter image description here enter image description here

enter image description here

user1745679
  • 209
  • 1
  • 2
  • 9
  • I was using the Mongodb for first time. I am reading articles on indexing and trying things but no breakthrough. Could anyone please advise me on this problem. – user1745679 Oct 16 '17 at 14:31
  • What version of MongoDB are you using? Are you using the WiredTiger storage engine? Also I'd add .explain() at the end of the query. Example: db.playfieldvalues.find({playfields: {$elemMatch:{ID:"Play.NHL.NHLHomeTeam" ,value: "BOS BOSTON BRUINS"}}}).explain(); If the winning plan stage is FETCH, IXSCAN it means it uses your index. You can also take a look on the documentation here https://docs.mongodb.com/manual/reference/explain-results/ – Alex P. Oct 16 '17 at 17:40
  • mongodb version: 3.4.9 WiredTiger storage engine: dont know . how to check. I downloaded mongodb installer and installed it on windows server 2012 OS. I ran explain function, I dont know how to interpret it. Will post the explain result in question – user1745679 Oct 16 '17 at 17:43
  • How do you measure the execution time? Is there a slow network in between your client and the server? – dnickless Oct 16 '17 at 19:31
  • execution time for second query 4.6 minutes. I am directly running on server. No network issues. – user1745679 Oct 16 '17 at 19:43

1 Answers1

0

The value field in embedded document stores multiple datatypes (int, string). Is this bad design?

If you ask me personally, coming from a java background, yes, it is a bad design. But with MongoDB it's not really a problem, as long as your application can handle it. FYI your index size should also be quite big you can check that by db.playfieldvalues.stats(), but that should also not matter. At least its not relevant to your performance issues.

Am i doing anything wrong in writing queries?

Well, the query is kind of complex, with a lot of conditions to be met. You can also change your query with $or operator to

db.playfieldvalues.find({
$and:[
  {playfields: {$elemMatch:{ID:"Play.NHL.NHLHomeTeam" ,value: "BOS BOSTON BRUINS"}}},
  {playfields: {$elemMatch:{ID:"Play.NHL.NHLAwayTeam" ,value: {$in: ["NYI NEW YORK ISLANDERS", "T.B TAMPA BAY LIGHTNING"]}}}},
  {playfields: {$elemMatch:{ID:"Play.NHL.NHLEventY" ,value: -38}}},
  {playfields: {$elemMatch:{ID:"Play.NHL.NHLEventX" ,value: {$gt: 0}}}},
  {playfields: {$elemMatch:{ID:"Play.NHL.NHLEventScoreDifferential" ,value: {$gt: 0}}}}
  ]
})

Try to simplify your query parameters as possible. Maybe even working with .limit(n) and .skip(n) is a better solution for you.

Am I missing any indexes?

Take a look at this post, although its old it might be relevant. Slow range query on a multikey index

But before you go ahead and drop and create new indexes, try to remove some parts of your query to try and locate what exactly could cause it. For example I'd remove these lines:

{playfields: {$elemMatch:{ID:"Play.NHL.NHLEventX" ,value: {$gt: 0}}}},
{playfields: {$elemMatch:{ID:"Play.NHL.NHLEventScoreDifferential"
,value: {$gt: 0}}}}

and check if the performance is still the same. Check out the documentation docs.mongodb.com/manual/core/multikey-index-bounds . Look at your indexBounds -> playfields.ID and playfields.value. I'm not sure why playfields.value: (0.0, 25.0). In the documentation it says that the bounds for the ratings: { $gte: 0 } predicate are [ [ 0, Infinity ] ]

Do I have to move data from embed documents in single document to multiple collections?

No you don't have to. There is no fixed schema, you can do as you like.

On your question in the comment on how to check if you are using wiredTiger: db.serverStatus().storageEngine. But since you are using mongo version 3.4.9 you are using wiredTiger

Alex P.
  • 3,073
  • 3
  • 22
  • 33
  • I removed range search on value field replaced with equal sign . And ran explain function. the execution time is 10.385 sec now. Total doc examined: 16036, nreturned: 2 . previous execution times are higher(the old explain output is posted in question section). I followed your links, the solution they recommend for range searches is use $lt,$gt in order. in my case I just used $gt only. – user1745679 Oct 16 '17 at 20:55
  • so the $gt is making it slower, correct? Also check out the documentation https://docs.mongodb.com/manual/core/multikey-index-bounds/ . Look at your indexBounds -> playfields.ID and playfields.value. I'm not sure why playfields.value: (0.0, 25.0). In the documentation it says that the bounds for the ratings: { $gte: 0 } predicate are [ [ 0, Infinity ] ]. Also this seems relevant: https://stackoverflow.com/questions/28361904/mongodb-degraded-query-performance. Try adding a $lt or $lte in your range queries. Does that help at all? – Alex P. Oct 16 '17 at 21:29
  • I added range search ( $lt and $gt), the explain function output says execution time: 7 sec. Lot better compared to having just $gt operator. with having range search the documents examined is 48,520. without range search (just $gt) it examined 206 K. But still 7 sec is lot time in real world for collection of 13 millions documents. – user1745679 Oct 16 '17 at 21:47
  • The below query no string searches takes 3min execution time. this query doesn't involve any range searches also. db.playfieldvalues.find({$and:[ {playfields: {$elemMatch:{ID:"Play.NHL.NHLHomeTeam" ,value: "BOS BOSTON BRUINS"}}}, {$or:[ {playfields: {$elemMatch:{ID:"Play.NHL.NHLAwayTeam" ,value: "NYI NEW YORK ISLANDERS"}}},{playfields: {$elemMatch:{ID:"Play.NHL.NHLAwayTeam" ,value: "T.B TAMPA BAY LIGHTNING"}}}]} ] }) – user1745679 Oct 16 '17 at 21:49
  • Regarding the 7 sec. I do think 7 sec is fine. On your last comment: Run your queries with explain() so you can actually see what indexes are used. From what I can see from the image pasted on your question, notice the indexBounds for ID [Play.NHL.NHLEventX,Play.NHL.NHLEventX]. So it still isn't an index-only query. To be honest with you your query is kind of complex. To recap I'll update my answer. – Alex P. Oct 16 '17 at 22:19
  • Hi Previously when I am having performance issues. I am running mongodb on Windows 2012 server R2, 8GB RAM (3 gb assigned to sql server on same machine). Now increased memory to 64gb on machine. sql server is assigned 16 gb. The query that takes 3 to 4 min now runs in 4 sec . How much memory Mongodb requires? is there any place in mongodb to assign the memory? – user1745679 Oct 19 '17 at 17:53
  • It's not something that has an absolute answer. Actually it is very well explained in the documentation. https://docs.mongodb.com/manual/faq/diagnostics/#must-my-working-set-size-fit-ram – Alex P. Oct 19 '17 at 19:07
  • As I mentioned in the answer, you can execute .stats() to see how big your data size including index size is. – Alex P. Oct 19 '17 at 20:45
  • 2 collections: data size: 39.676840156 , index size: 4GB, storagesize: 12 GB. I have collection, index, log files in same location (C:\MongoDb) will that cause any performance issues? How to change the directories now? – user1745679 Oct 19 '17 at 20:49
  • I'm not aware that log files can cause performance issues like the once you are facing. If your log files are too big then you can see here how to rotate the logs https://docs.mongodb.com/manual/tutorial/rotate-log-files/. The RAM as you read in the documentation, hopefully, depends on a bunch of stuff. But the most important is that you have enough RAM for your indexes. Quote: "Your working set should stay in memory to achieve good performance. Otherwise many random disk IO’s will occur, and unless you are using SSD, this can be quite slow". – Alex P. Oct 20 '17 at 07:58