3

One question regarding data fetching approach,

First Approach:

Let say I have two document

userdoc1
{
“status”:“pending”
“usertype”:“VIP”
“userid”:“123”
}

for above document let say my documentid is status::usertype [just to clarify,this document id will be unique in our case ]

userdoc2
{
“userid”:“123”,
“fname”:“abc”,
“lname”:“xyz”,
“age”:20;
“address”:“asdf”
}

for userdoc2, let say userid is my documentid

If i do a get operation i would proceed like this (here idea is to fetch data based on document id)

select userid from userdoc1 with key “pending::VIP”;

and then

select * from userdoc2 with key “123”;

Second Approach:

I have only one document

userdoc
{
“status”:“pending”
“usertype”:“VIP”
“userid”:“123”
“fname”:“abc”,
“lname”:“xyz”,
“age”:20;
“address”:“asdf”
}

Here, documentid is “status::usertype” and we have secondary index on userid

Here if get the data like this(here idea is to fetch data based on secondary index):

select * from userdoc where userd=“123”;

Could you please explain which approach will give high read performance assuming high data load with 100 of nodes in a cluster and XDCR and other factors ?

Bifrost
  • 417
  • 5
  • 23
  • 1
    I think your question needs clarification. It looks like what you really want is all the fields and values for a user in, for example, a status of pending and type of VIP. You talk about the document ID being, in that example, "pending::VIP", but that wouldn't be unique if you have more than one user that meets those criteria. That's not allowed. Also, when you say high load, it may matter whether it's read-heavy or write-heavy, or roughly even. – Hod Feb 06 '19 at 23:01
  • 1
    @Hod, it is just an example to showcase my in hand problem, in our case it is always going to be unique "pending::VIP". I just want to know which approach is more efficient in terms of load. – Bifrost Feb 07 '19 at 02:06

2 Answers2

3

Option 1 is going to have two roundtrips from the client to the server to run two cheap queries. Option 2 is going to have one roundtrip from the client to the server to run one slightly more expensive query.

I can't be completely confident without measuring, but I would bet my money on option 2. Roundtrip costs can be a bitch.

Be sure to use a proper index on userid for option 2 and use a prepared query with the userid as a parameter. That should be the fastest option.

Johan Larson
  • 1,880
  • 1
  • 11
  • 14
3

The dominant factor (as Johan Larson says in his answer) is likely to be the roundtrip count. Your first solution will have two roundtrips from the application to the cluster, while the second will have only one. There are some potential nuances, though.

An important point to note is that key/value retrievals are always going to be fastest. Those requests will go directly to nodes running the data service. With Couchbase, the clients access the node containing the data directly, not via a master-slave arrangement. In other words, you can fulfill a k/v request with a single round trip only involving the data node that has the actual document.

Using your first approach, you can avoid N1QL entirely. Just do a straight k/v get with id status::usertype, pull out the userid, then use that to do a get of the second document. You could even use the subdocument API to only return the userid.

The second approach will involve an index, and a N1QL query, so you're hitting potentially up to three different machines in your cluster. Whether this will be faster depends on topology. If your application is running alongside your cluster (meaning the network throughput/latency is similar to the intracluster times), I think the k/v approach could actually be faster. If the network latency from app to cluster is longer, the second approach is likely faster.

There's a further consideration. If the entire result is "covered" by the index you create for the query (meaning you store all the parts of the document you care about in your index) then the response can be provided entirely by the index service. This would cut the N1QL approach down to hitting the query service and the index service, which will be faster.

To go into a little more detail, your question involves data, indexing, and query. Couchbase splits these functions into separate services, meaning you can scale each capacity independently. That's also why you can be hitting three different machines with the N1QL query.

It will also depend on the nature of the data load. For example, if it's read-heavy vs. write-heavy. Write-heavy with an index will mean index updates, whereas read-heavy won't. Similarly, XDCR will be affected by read vs. write.

Hod
  • 2,236
  • 1
  • 14
  • 22