1

I have a user collection with 52 million records. Each user documents has a list of comments and comment_id has a unique index on it.

{
  _id:123, 
  user_name:"xyz",
  comments:[
    {
      comment_id:123,
      text:"sd"
    },
    {
      comment_id:234,
      text:"sdf"
    }
    ......,
    (63000 elements)
  ]
}

The totalIndexSize of the comment_id index is 104GB. I have around 100 documents out of 52M which have 63000 elements in comments array.

My aim is to delete the old comments and reduce the size of the comments array by over 80%. Earlier when I tried to update the document using this query

db.user.updateOne({_id:_id},{$set: {"comments":newCommentsArray}},upsert=True)

here the newCommentsArray will be of size around 400. This operation took around 130 sec to execute.

My Questions are:

1) What could be the reason update query above took 130sec. Is it because of huge unique index size on comment_id field? (I believe that updating the comments array with new comments array will try to rearrange the index for all the deleted 63000 elements and insert the new elements in the index.)

2) I had an other approach use $pull which is basically pulling 100 comments from the comments array and waiting for 5 sec and then execute for next batch of 100 comments. What do you think of this solution.

3) If the above solution is no good can you suggest a good way to reduce the comments array by over 80%.

  • Can you post please: list of current indexes (´db.user.getIndexes()´), if `_id` is ObjectId or numeric value. And where are you running update query? MongoDB shell, your app. – Valijon Feb 22 '20 at 18:50
  • Hey, I'm running the query on my app, using pymongo library. These are my indexes [ { "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "db.user" }, { "v" : 2, "key" : { "comments.comment_id" : 1 }, "name" : "comments.comment_id_1", "ns" : "db.user" } ] – kothinti ayeshreddy Feb 22 '20 at 18:57
  • If you don't have index for `user_id`, you are running update with such criteria: `_id:user_id` is it correct? Let's continue in [chat](https://chat.stackoverflow.com/rooms/208332/discussion-between-valijon-and-kothinti-ayeshreddy) – Valijon Feb 22 '20 at 19:08
  • Sorry about that, I'm querying by _id, updated the question. Sorry that i cant join the chat I must have 20 reputation to join. – kothinti ayeshreddy Feb 22 '20 at 19:16
  • Check again to join chat – Valijon Feb 22 '20 at 19:18
  • Nope same again, I'm getting this "You must have 20 reputation on Stack Overflow to talk here. See the faq." – kothinti ayeshreddy Feb 22 '20 at 19:20
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/208333/discussion-between-valijon-and-kothinti-ayeshreddy). – Valijon Feb 22 '20 at 19:21
  • You already have 20 points, refresh the page and try again – Valijon Feb 22 '20 at 19:22

1 Answers1

1

You have a huge index for comment_id it's because you have Multikey Index

MongoDB creates an index key for each element in the array.

In your case, _id index has ~1GB size, comment_id is avg ~100/per document (to get ~104GB)

1) What could be the reason update query above took 130sec

Mongodb stores indexes with B-tree structure. B-tree properties:

Algorithm   Average     Worst case
Space       O(n)        O(n)
Search      O(log n)    O(log n)
Insert      O(log n)    O(log n)
Delete      O(log n)    O(log n)

It means, to insert indexes for comments, MongoDB needs iterate O(log n) (~25 iteration for each item) in the worst case.

2) I had an other approach use $pull which is basically pulling 100 comments from the comments array and waiting for 5 sec and then execute for next batch of 100 comments.

As comments are indexed, it will be fast (remember O (log n) property). It's not necessary wait for 5 secs, because since MongoDB 3.0 it uses multi-granularity locking which means locks only affected documents.

Also, you may reduce with $push operator like this:

db.user.update({ },{$push: {comments: {$each: [ ], $slice: -400}}})

This will insert [ ] (in this case 0 item) items and slice 400 items from the end

3) If the above solution is no good can you suggest a good way to reduce the comments array by over 80%.

Even if you reduce comment array, WiredTiger won't release unneeded disk space to the operating system.

Running dropIndex

db.user.dropIndex({ "comment_id" : 1 })

Warning: Since v4.2 obtains an exclusive lock on the specified collection for the duration of the operation. All subsequent operations on the collection must wait until db.collection.dropIndex() releases the lock.

Prior to v4.2 this command obtains a write lock on the affected database and will block other operations until it has completed.

Or running compact

Warning: compact blocks operations for the database it is currently operating on. Only use compact during scheduled maintenance periods. Also, you must authenticate as a user with the compact privilege action on the target collection

Valijon
  • 12,667
  • 4
  • 34
  • 67