8

I'm building an ArangoDB edge collection that consists of many "types". By type, think animal species taxonomy.

I will be building a graph that connects all of these. Example: parent/child of ancient homo species: Homo habilis->Homo floresiensis->Homo erectus->Homo sapiens

Putting they different types in different collections would only be for superficial organizational reasons. There's a small possibility that it would be useful in the future for features I haven't thought of yet.

My specific questions is: Does building graphs in ArangoDB, that uses multiple collections, take a performance hit? Will using one large collection be more efficient for graphs?

Answering the first comment: If I break this out into different edge collections, it would be 4 collections with about 300,000 rows in each. Type can have multiple parents and children. The types of queries would be shortest path and any connectedness between each. If that makes sense? 6 degrees of Kevin Bacon type thing.

EDIT: Please see the comments for some questions and answers. Almost every single query will span multiple types. Many queries will be 5-7 vertices deep. This project will almost exclusively be READING... I'm not worried about write speed at all.

EDIT 2: Will I be using a single instance or a distributed cluster? Honestly, either! Whatever will speed up reads. You tell me.

Chemdream
  • 618
  • 2
  • 9
  • 25
  • The answer will probably depend on the types of queries you will be running. Could you be more specific about that, and also tell us how many different types of edge collections you envision? You only gave one example (parent/child). It might also be helpful to know how many node collections you expect, and roughly how many nodes? – peak Jan 21 '18 at 17:50
  • Thanks. I updated my question with more details. – Chemdream Jan 21 '18 at 23:23
  • Will single queries typically span multiple edge collections? Could you give an example of a second edge collection, as well as an example of a query that DOES span multiple edge collections? – peak Jan 22 '18 at 03:31
  • Almost ever single query would span multiple data collections but only a single edge collection. – Chemdream Feb 01 '18 at 12:40

2 Answers2

5

In the single server setup, using multiple collections does not have any penalty. Especially if your query does not span all edge collections, it will be faster to perform lookups on smaller collections.

How much faster / slower this will depends on the storage engine (rocksdb / mmfiles). Given that you want to go for maximum read performance mmfiles will be likely faster.

Simon Grätzer
  • 286
  • 2
  • 6
  • Simon, With a multi-server setup, when would multi server clusters increase speed? Reading on Arango's site, it seems like it actually slows things down because of network latency. – Chemdream Apr 10 '18 at 15:41
  • Also, to clarify, you are saying that "if you are on a single server, using multiple collections should increase your speed"? – Chemdream Apr 10 '18 at 15:42
  • The multi-server setup will increase performance, when your queries are executed in parallel on multiple machines. It also allows you to scale your DB, if your data does not fit on one machine anymore. – Simon Grätzer Apr 12 '18 at 11:51
  • Using smaller collections may be a little faster than doing lookups in larger collections. Using multiple collections for different types of things mainly pays off, when it allows you to avoid adding `FILTER` statements in queries. I.e. you put every type of object in a different collection, instead of having a `type` attribute. – Simon Grätzer Apr 12 '18 at 12:00
3

I've got a taxonomy project in ArangoDB that seems roughly equivalent in terms of the data record count that you report.

This amount of data presents no performance challenges to ArangoDB. I've chosen to focus on modeling the relationships to best represent the dataset and have not regretted this.

In your example I'd probably have one collection for the species nodes. And start with one collection for the 'begats' edge collection to capture the species evolution pathways.

If there are multiple schools of thought, multiple classifications, or other frameworks that describe alternate pathing between the species then I'd be looking at capturing each in a different edge collection.

For example if one taxonomy pathing is arrived at by jaw shape, another always uses the pelvis, if countryX has another method, and another is DNA based it could be instructive to dedicate an edge collection to each. You'd be creating alternative interconnect networks using exactly / mostly same set of species nodes.

Species taxonomy isn't my field and the examples are probably nonsense. But I'd suggest not missing the opportunity to structure the data in the most useful way. The performance will very likely not be an issue.