4

I have a graph database which looks like this (simplified) diagram:

Graph diagram

Each unique ID has many properties, which are represented as edges from the ID to unique values of that property. Basically that means that if two ID nodes have the same email, then their has_email edges will both point to the same node. In the diagram, the two shown IDs share both a first name and a last name.

I'm having difficulty writing an efficient Gremlin query to find matching IDs, for a given set of "matching rules". A matching rule will consist of a set of properties which must all be the same for IDs to be considered to have come from the same person. The query I'm currently using to match people based on their first name, last name, and email looks like:

g.V().match(
    __.as("id").hasId("some_id"),
    __.as("id")
        .out("has_firstName")
        .in("has_firstName")
        .as("firstName"),
    __.as("id")
        .out("has_lastName")
        .in("has_lastName")
        .as("lastName"),
    __.as("id")
        .out("has_email")
        .in("has_email")
        .as("email"),
    where("firstName", eq("lastName")),
    where("firstName", eq("email")),
    where("firstName", neq("id"))
).select("firstName")

The query returns a list of IDs which match the input some_id.

When this query tries to match an ID with a particularly common first name, it becomes very, very slow. I suspect that the match step is the problem, but I've struggled to find an alternative with no luck so far.

Misha Brukman
  • 12,938
  • 4
  • 61
  • 78

1 Answers1

9

The performance of this query will depend on the edge degrees in your graph. Since many people share the same first name, you will most likely have a huge amount of edge going into a specific firstName vertex. You can make assumptions, like: there are fewer people with the same last name than people with the same first name. And of course, there should be even fewer people who share the same email address. With that knowledge you just can start to traverse to the vertices with the lowest degree first and then filter from there:

g.V().hasId("some_id").as("id").
  out("has_email").in("has_email").where(neq("id")).
  filter(out("has_lastName").where(__.in("has_lastName").as("id"))).
  filter(out("has_firstName").where(__.in("has_firstName").as("id")))

With that, the performance will mostly depend on the vertex with the lowest edge degree.

Daniel Kuppitz
  • 10,846
  • 1
  • 25
  • 34
  • 1
    This immediately resulted in drastically shorter query times, thank you! Does the order of the filters (lines 3-4) matter at all, or is it only line 2 that has an effect on query time? – Charlie Collard Nov 11 '19 at 18:11
  • I understand that the filter step knows which nodes to retain because it looks at which `lastName` is connected to the original id and only retains those ids that are connected to the shared `lastName`. The syntax just seems to be a bit confusing here, should we just understand the `.as("id")` in the filter as `.eq("id")`? – Charlie Collard Nov 11 '19 at 18:28
  • 1
    Exactly. `foo.as("x").where(bar.as("x"))` is pretty much the same as `foo.as("x").filter(bar.as("y").where("y", eq("x")))`. – Daniel Kuppitz Nov 11 '19 at 20:27
  • 1
    Regarding the order of lines 3 and 4: I'm not sure if that makes a difference as it heavily depends on how this is being executed under the hood, but in a perfect world, it should make no difference as it only verifies the existence of a specific adjacent vertex. – Daniel Kuppitz Nov 11 '19 at 20:30