1

I have a mongodb database with collection documents that are approximately as follows:

// book document
{
    _id: $oid,
    userId: "..."
    name: "name",
    description: "description"
    status: "draft"
    // ...
}

// page document
{
    _id: $oid,
    bookId: "..."
    name: "name",
    description: "description"
    // ...
}

A book can have thousands of pages. Each book has a status property which can have one of the statuses "draft", "published" or "archived".

A "draft" book can be created from any other book, e.g. with "draft", "published" or "archived" status.

Every time a new draft is created, a new book document with all the same pages as the original book are created.

Any "draft" book and all its pages can be edited independently of the original book and its pages. This means that any editing on the new draft book or any of its pages does not affect the original book or its pages.

In the design, priority should be given to book query performance over draft creation performance, i.e. querying a book with all its (paginated) pages is the top performance priority.

What is the best approach to achieve the requirements above? Any modification to the database schema is possible at this stage.

These are options that I considered:

  1. Clone the entire book with all its pages any time a new draft is created off an existing book. This would provide clean drafts that can be independently edited in their entirety and queried in the fastest way possible with a simple find. The downside is that the size of the page collection would grow rapidly with the number of drafts and most of the pages in cloned books would be the exact same version as in the original book. Also, draft creation time would grow proportionally with the number of pages in a book.

  2. Add a parentId property to the book document to refer to the original book and reuse all un-modified pages from the parent. Any time a page is edited, a new page is cloned with bookId set as the id of the new draft. This approach would reuse most of the page documents but querying a draft book would be much more complex as we would need to follow the parentId chain to find all modified pages and all pages belonging to all the parent books (drafts can be created multiple times off different drafts). Draft creation would be very fast (just cloning one book) but query times would grow proportionally with the number of parents of a draft. Also, when querying a draft where pages have been edited in different parents, we would need to filter out all old page versions and keep only the newest version in the chain.

Lorenzo Polidori
  • 10,332
  • 10
  • 51
  • 60

1 Answers1

1

Behaviour of the 2 approaches you described are quite different. In the first (full clone) case, changes in original pages apply to original book only. In the second (hierarchical) case, changes in original pages are automatically propagated to all derivative drafts. I wouldn't question which one is more performant, as it is like comparing apples to oranges.

Consider a 3rd option (copy-on-change) - store references to the pages in a book:

// book document
{
    _id: $oid,
    userId: "..."
    name: "name",
    description: "description"
    status: "draft"
    // ...
    pages: [
        $oid1,
        $oid2,
        $oid3,
        ....
    ]
}

Creating a new draft is as easy as cloning a single book document. Querying a book with pages is a trivial lookup aggregation.

The most expensive operation will be changes in the pages and may require a 2 phase commit if you need strong data integrity.

With this approach you can choose either to propagate changes to derivative drafts, or keep it local. If later, you may need an additional housekeeping step to remove orphaned pages.

Alex Blex
  • 34,704
  • 7
  • 48
  • 75
  • Yes, in the hierarchical case new versions of a page would need to hide the older versions in original books. This is quite complex and might need to be done at application level. I also thought about your 3rd option. A couple of questions on that 1. To use lookup aggregation you need to give up on sharded collection (though I am still unsure on whether I should generally care?) 2. Apart from the data integrity issue, why do you think changing a page would be expensive? Would it not just creating a new page document and swapping the old id with the new id in the list of page ids? – Lorenzo Polidori Apr 03 '18 at 09:54
  • It's not that new page in a draft replaces the same page in original document - this part works in both your cases. The difference is when you add a new page to original document. In first case it is not added to the draft as it's a full copy made before new page has been added. In the second case the new page will be in the draft as well, as it is still a reference to the original document. – Alex Blex Apr 03 '18 at 13:02
  • Yes, I think any new pages added to the original book would need to be filtered out by creation date. – Lorenzo Polidori Apr 03 '18 at 13:07
  • Regarding shards - it's okay to have pages sharded as long as books remains unsharded. If both collection needs to be sharded, you can live without aggregation at all. Just make 2 queries - find a book, then find pages `{_id: {$in: book.pages.slice(start, end)} }`. Regarding expensive wrights - yes, it is expensive due to complex update of multiple documents. Swapping the old id with the new id in the list of page ids still require some efforts to ensure concurrent updates of the same book work as expected. – Alex Blex Apr 03 '18 at 13:08
  • Thanks, yes I guess handling concurrent updates will be sorted when MongoDB v4 transactions will come into the picture. For now, either 2-phase commits or weak data integrity. – Lorenzo Polidori Apr 03 '18 at 13:47