5

I want to use sharding in arangoDB.I have made coordinators, DBServers as mentioned in documentation 2.8.5. But still can someone still explain it in details and also how can I able to check the performance of my query after and before sharding.

Haseb Ansari
  • 587
  • 1
  • 7
  • 23

1 Answers1

2

Testing your application can be done with a local cluster, were all instances run on one machine - which is what you already did, if I get that correctly?

An ArangoDB cluster consists of coordinator and dbserver nodes. Coordinators don't have own user specific local collections on disk. Their role is to handle the I/O with the clients, parse, optimize and distribute the queries and the user data to the dbserver nodes. Foxx services will also be run on the coordinators. DBServers are the storage nodes in this setup, they keep the user data.

To compare the performance between clustered and non clustered mode you import a dataset on a clustered instance and a non clustered one and compare the query result times. Since the cluster setup can have more network communication (i.e. if you do a join) than the single server case, the performance can be different. On a physically distributed cluster you may achieve higher throughput, since in the end the cluster nodes are own machines and have their own IO paths that end on separate physical harddisks.

In the cluster case you create collections specifying the number of shards using the numberOfShards parameter; the shardKeys parameter can control the distribution of your documents across the shards. You should choose that key so documents distribute well across the shards (i.e. are not inbalanced to just one shard). The numberOfShards can be an arbitrary value and doesn't have to corrospond to the number of dbserver nodes - it could even be bigger so you can more easily move a shard from one dbserver to a new dbserver when scaling up your cluster to more nodes in the future to adapt to higher loads.

When you're developping AQL queries with cluster use in mind, its essential to use the explain command to inspect how the query is distributed across the clusters, and where filters can be deployed:

db._create("sharded", {numberOfShards: 2})
db._explain("FOR x IN sharded RETURN x")
Query string:
 FOR x IN sharded RETURN x

Execution plan:
 Id   NodeType                  Est.   Comment
  1   SingletonNode                1   * ROOT
  2   EnumerateCollectionNode      1     - FOR x IN sharded /* full collection scan */
  6   RemoteNode                   1       - REMOTE
  7   GatherNode                   1       - GATHER
  3   ReturnNode                   1       - RETURN x

Indexes used:
 none

Optimization rules applied:
 Id   RuleName
  1   scatter-in-cluster
  2   remove-unnecessary-remote-scatter

In this simple query the RETURN & GATHER -nodes are on the coordinator; the nodes upwards including the REMOTE-node are deployed to the DB-server.

In general less REMOTE / SCATTER -> GATHER pairs means less cluster communication. The closer FILTER nodes can be deployed to *CollectionNodes to reduce the amount of the documents to be sent via the REMOTE-nodes the better the performance.

dothebart
  • 5,972
  • 16
  • 40
  • Ok i have setup the cluster as described above and it's working. Thank you. But I want to know some how when I insert 40,000 documents in a cluster setup then how can one find how many data have been inserted in a particular shard. For e.g I have created a cluster setup with 5 machines (all 8GB RAM and i7 processor approx 3 GHz) out of which one is coordinator and the rest are dbservers. But a single simple query with filter takes 77.77 secs to execute (result count was 39999 documents). how will know how much data is distributed among my 4 shards ? – Haseb Ansari Mar 17 '16 at 16:32
  • 1
    Sorry, there currently is no easy way to get statistics about the distribution of the documents across the shards. As a workaround you could check the diskusage of the dbservers. I will add this as a feature request for the post 3.0 time. Regarding your query - The coordinator has to fetch all data from the DB-Server shards, parse the sub-results, and build up the accumulated result in memory, and return it. In your case it has to do this for ~40k documents. We expect the situation to improve with 3.0 and velocypack. – dothebart Mar 18 '16 at 15:45
  • Good to hear this would wait for 3.x release.....I would also suggest that in ArangoDB when dealing with fulltext search make the possibility of handling multiple attributes at a time while search query and also nested array indexes in the roadmap of Arango.......I am also working on this feature for Arango will let ur team know if I succeed :) – Haseb Ansari Mar 18 '16 at 20:11