1

We are trying to figure out which data model would be best for saving data to Couchbase when size of the document can get very large.

Save all the data in a single document, with the possible problem of reaching the maximum allowed by Couchbase (https://docs.couchbase.com/server/current/learn/clusters-and-availability/size-limitations.html). The structure of the document would be something like:

{
  "id": "x",
  "name": "Test 1",
  "tokens": [
  {
    "value": "bad words 1",
    "caseSensitive": true,
    "whole": true
  },
  {
    "value": "bad words 10",
    "caseSensitive": true,
    "whole": true
  },
  [...]
  ],
  "_class": "Document"
}

Save to multiple documents with the id of the parent document as a primary index, so we can do a query of ChildDocuments by parentId:

Parent:

  {
    "id": "x",
    "name": "Test 1",
    "_class": "ParentDocument"
  }

Children:

  {
    "id": "y",
    "parentId": "x",
    "value" "value",
    "caseSensitive": true,
    "whole": true,
    "_class": "ChildDocument"
  }

Since db admins tell us that it is not a good practice to add indexes because of their size and performance, the single document option seems to be the only option, but what can be done to avoid reaching the maximum size that Couchbase can support?

Thanks in advance

Matthew Groves
  • 25,181
  • 9
  • 71
  • 121
  • I don't understand what indexes have to do with separate document modeling, would you please expand on that? – Matthew Groves Feb 23 '22 at 14:44
  • 1
    To retrieve children by parentId, we should create an index with parentId field. For example, a query like: SELECT * FROM bucket WHERE parentId = "x" – José Puente Fuentes Feb 23 '22 at 18:26
  • Oh, I understand. In that case, would it be feasible to store childrenIDs within the parent instead? Then you can use k/v lookups. – Matthew Groves Feb 23 '22 at 21:32
  • 1
    Thanks Matthew!, it does seem like a good option! But eventually the size limit would be reached, so would we have to split the parent document into multiple documents? how would be the best option to save that relationship between parent documents? – José Puente Fuentes Feb 24 '22 at 12:23
  • You're saying that a parent document with an array of just IDs that refer to children objects will likely grow (unbound) to larger than 20 mb? – Matthew Groves Feb 24 '22 at 14:53
  • 1
    Although it is unlikely, it is possible that this possibility will occur in the future. – José Puente Fuentes Feb 28 '22 at 10:03

1 Answers1

0

I think this will ultimately come down to trade-offs / opinions and maybe even benchmarks. But here's where I would start: parent document containing an array of children document IDs. For example

parent (id: 1)
{
   "name" : "Parent 1",
   "foo" : "bar",
   "children" : ["100","101"]
}

child (id: 100)
{
   "name" : "Child 1",
   "baz" : "qux"
}

child (id: 101)
{
   "name" : "Child 2",
   "zip" : "zap"
}

// ... and so on ...

In this case, once you fetch the parent, you can then fetch (some? all?) the children with key/value operations.

If the list of children becomes so obscenely long that it causes the parent document to exceed the document size limit (20 mb in Couchbase), then you could split it into multiple documents. Just a back-of-the-envelope example:

parent (id: 1)
{
   "name" : "Parent 1",
   "foo" : "bar",
   "children" : ["100","101"],
   "childrenArchiveId" : "parent1::archive1"
}

parent archive 2 (id: parent1::archive1)
{
   "children" : ["9999998", "9999999"]
}

If there really are that many children, hopefully you don't need to fetch ALL of them (if you do, chances are you have a load of other problems). Maybe just the most recent or the most active are the ones you need? Once the 20mb threshold is hit, you could store the less active ones in an "archive" document, a separate, auxiliary parent document.

But I think 20mb allows you to store a ton of children IDs (even more if you store them as integers instead of strings, assuming that's possible).

You might come up with an archival strategy that better fits your use case (maybe a new "parent" document every year, for instance?).

Matthew Groves
  • 25,181
  • 9
  • 71
  • 121
  • 1
    Thank you very much for such a comprehensive answer! The problem we have is that we have to fetch all the children every time we check the parent, so we can't archive some children. Testing we see that saving them in a single file performs better in our use case, but I think we could use a kind of "childrenArchiveId" that you mention to store files of less than 20MB that contain all possible children. – José Puente Fuentes Mar 02 '22 at 11:51
  • That's a pretty crazy scenario, having to fetch potentially thousands of children every time. Can you say more about why you need to do that? There might be a more efficient way to model it than parent-hasmany-children. – Matthew Groves Mar 02 '22 at 14:49
  • 1
    It is about having black lists of texts with case sensitive, blocking and whole properties (they could be extended in the future). For example, in the list "x" there are "y" texts (and their properties) and the client makes a request to check if one text or more texts match with any of the texts in list or lists. – José Puente Fuentes Mar 02 '22 at 18:49
  • You might want to check out the Full Text Search (FTS) service in Couchbase; it could possibly better suit your use case. – Matthew Groves Mar 02 '22 at 19:51