In ArangoDB, will querying, with filters, from the neighbor(s) be done in O(n)?

Question

I've been reading Aql Graph Operation and Graphs, and have found no concrete example and performance explanation for the use case of SQL-Traverse.

E.g.:

If I have a collection Users, which has a company relation to collection Company

Collection Company has relation location to collection Location;

Collection Location is either a city, country, or region, and has relation city, country, region to itself.

Now, I would like to query all users who belong to companies in Germany or EU.

SELECT from Users where Users.company.location.city.country.name="Germany";
SELECT from Users where Users.company.location.city.parent.name="Germany";

or

SELECT from Users where Users.company.location.city.country.region.name="europe";
SELECT from Users where Users.company.location.city.parent.parent.name="europe";

Assuming that Location.name is indexed, can I have the two queries above executed with O(n), with n being the number of documents in Location (O(1) for graph traversal, O(n) for index scanning)?

Of course, I could just save regionName or countryName directly in company, as these cities and countries are in EU, unlike in ... other places, won't probably change, but what if... you know what I mean (kidding, what if I have other use cases which require constant update)

score 4 · Accepted Answer · edited Jun 20 '20 at 09:12

I'm going to explain this using the ArangoDB 2.8 Traversals.

We Create these collections to match your shema using arangosh:

db._create("countries")
db.countries.save({_key:"Germany", name: "Germany"})
db.countries.save({_key:"France", name: "France"})
db.countries.ensureHashIndex("name")

db._create("cities")
db.cities.save({_key: "Munich"})
db.cities.save({_key: "Toulouse")

db._create("company")
db.company.save({_key: "Siemens"})
db.company.save({_key: "Airbus"})

db._create("employees")
db.employees.save({lname: "Kraxlhuber", cname: "Xaver", _key: "user1"})
db.employees.save({lname: "Heilmann",  cname: "Vroni",  _key: "user2"})
db.employees.save({lname: "Leroy",  cname: "Marcel",  _key: "user3"})

db._createEdgeCollection("CityInCountry")
db._createEdgeCollection("CompanyIsInCity")
db._createEdgeCollection("WorksAtCompany")


db.CityInCountry.save("cities/Munich", "countries/Germany", {label: "beautiful South near the mountains"})
db.CityInCountry.save("cities/Toulouse", "countries/France", {label: "crowded city at the mediteranian Sea"})

db.CompanyIsInCity.save("company/Siemens", "cities/Munich", {label: "darfs ebbes gscheits sein? Oder..."})
db.CompanyIsInCity.save("company/Airbus", "cities/Toulouse", {label: "Big planes Ltd."})


db.WorksAtCompany.save("employees/user1", "company/Siemens", {employeeOfMonth: true})
db.WorksAtCompany.save("employees/user2", "company/Siemens", {veryDiligent: true})
db.WorksAtCompany.save("employees/user3", "company/Eurocopter", {veryDiligent: true})

In AQL we would write this query the other way around. We start with the constant time FILTER on the indexed attribute name and start our traversals from there on. Therefor we filter for the country "Germany":

db._explain("FOR country IN countries FILTER country.name == 'Germany' RETURN country ")
Query string:
 FOR country IN countries FILTER country.name == 'Germany' RETURN country 

Execution plan:
 Id   NodeType        Est.   Comment
  1   SingletonNode      1   * ROOT
  6   IndexNode          1     - FOR country IN countries   /* hash index scan */
  5   ReturnNode         1       - RETURN country

Indexes used:
 By   Type   Collection   Unique   Sparse   Selectivity   Fields       Ranges
  6   hash   countries    false    false        66.67 %   [ `name` ]   country.`name` == "Germany"

Optimization rules applied:
 Id   RuleName
  1   use-indexes
  2   remove-filter-covered-by-index

Now that we have our well filtered start node, we do a graph traversal in reverse direction. Since we know that Employees are exactly 3 steps away from the start Vertex, and we're not interested in the path, we only return the 3rd layer:

db._query("FOR country IN countries FILTER country.name == 'Germany' FOR v IN 3 INBOUND country CityInCountry, CompanyIsInCity, WorksAtCompany  RETURN v")

[ 
  { 
    "cname" : "Xaver", 
    "lname" : "Kraxlhuber", 
    "_id" : "employees/user1", 
    "_rev" : "1286703864570", 
    "_key" : "user1" 
  }, 
  { 
    "cname" : "Vroni", 
    "lname" : "Heilmann", 
    "_id" : "employees/user2", 
    "_rev" : "1286729095930", 
    "_key" : "user2" 
  } 
]

Some words about this queries performance:

We locate Germany using a hash index is constant time -> O(1)
Based on that we want to traverse m many paths where m is the number of employees in Germany; Each of them can be traversed in constant time. -> O(m) at this step.
Return the result in constant time -> O(1)

All combined we need O(m) where we expect m to be less than n (the number of employees) as used in your SQL-Traversal.

Oh, I get it, so I should change my mindset, instead of saying "get me a list of users whose city whose country is Germany", I should ask for "take a look at Germany, follow the paths until you get to users and get me that list". Then what if I have more than 1 condition (sorry for the old mind set here) SELECT from Users where Users.company.location.city.country.name="Germany" and Users.department.parent.parent="Product development"; with "department" can be hierarchal (just like location), e.g. "Backend" -> "Web development" -> "Software development" -> "Product development"? — TruongSinh, Dec 10 '15 at 19:20
Is the "graph intersection" suitable for the above-described scenario, and is it O(m+n) where m is number if employees in Germany and n is the number of employees in "Product development" department? And is `GRAPH_COMMON_NEIGHBORS` the right function? — TruongSinh, Dec 10 '15 at 19:29
You can add [FILTER statements to the traversal part](https://docs.arangodb.com/devel/Aql/GraphTraversals.html#filter-examples) and would probably therefore want to filter for something like path.vertices[3].department == "product development" - or if that comes from another collection, you could first get that department with a `LET department=(FOR d IN departments FILTER name="Product Development" RETURN d)` — dothebart, Dec 11 '15 at 08:27
If you filter on `path.vertices[n].something` many queries will be able to prune all paths at length **n** if they do not match the filter condition. This affects the number traversed paths and reduces it. The Complexity of the filter depends on the comparison. A simple `==` will be constant time. If you use a more complex filter function, its complexity will be added. Path filters will only be executed on the exact traversal depth **n**. In General filters improve performance. You can use `db.explain` for estimates. — dothebart, Dec 11 '15 at 16:18

In ArangoDB, will querying, with filters, from the neighbor(s) be done in O(n)?

1 Answers1