0

I have a user search feature in my app where the searcher don't want to see some results, he does this by "blocking" a tag, when blocking a tag all users that are "subscribed" to that tag will be ignored in his search results.

I'm writing the query to filter the search results and I found 2 ways of getting the same:

First:

g.V(1991)
.out("blocked").fold().as("blockedTags")
.V().hasLabel("user")
.not(
    where(
        out("subscribed").where(
            within("blockedTags")
        )
    )
)

Second:

 g.V(1991).as("user")
 .V().hasLabel("user")
 .not(
    where(
        out("subscribed")
        .in("blocked")
        .as("user")
    )
)

Gremlify: https://gremlify.com/xnqhvtzo6b

One uses within() and the other performs 2 steps out() and in(), I want to know which one is faster so I can decide which one to use, these 2 options are possible in many queries of my application.

EDIT:

I ran both queries in the gremlin console with profile() step at the end but the >TOTAL field gives random time numbers from 0.300ms to 1.220ms for both queries, because of this I don't know how to compare the performance of 2 queries.

fermmm
  • 1,078
  • 1
  • 9
  • 17
  • 1
    A good way to analyze such questions is to profile the queries. Have you tried sending the queries to the /status endpoint and observing any differences? – Kelvin Lawrence Oct 27 '20 at 23:44
  • I'm not yet running my code on Neptune, I'm still developing using gremlin-server in localhost. I edited the question with my result after running the profile() step in the gremlin console – fermmm Oct 28 '20 at 00:14
  • OK and I meant /profile above and not /status – Kelvin Lawrence Oct 28 '20 at 00:22
  • I edited my question again because after executing profile() many times it gives random execution times for both queries – fermmm Oct 28 '20 at 00:47
  • From the tags on the question It looks like you are using Amazon Neptune. Neptune has a /profile endpoint that you can send a query to using curl or from the Neptune notebooks (workbench). That will give more insight into how the query planner processed the query. In general that is a good way to get insight into how a query is performing. – Kelvin Lawrence Oct 28 '20 at 13:23
  • My plan is to use Neptune when the development is finished, I hope the profile endpoint does not give different numbers each time is executed like the profile() step – fermmm Oct 28 '20 at 16:22

1 Answers1

3

I will offer a general answer here that is largely derived from the comments on the question itself. It really isn't possible to profile() one graph and then project those results on another. They will each have different capabilities and performance characteristics. If you need to know which of two approaches to a query is better, then you must test both traversals on the graph system you intend to target.

I'd also be wary of going too far in a particular development direction without doing ongoing testing on the target graph. Just as you wouldn't do all your development on MySQL only to switch to Oracle when it was time to go to production, you really shouldn't try to build your entire application against a graph you don't intend to use. There are subtle differences in these systems that could make a significant differences to you.

As to the differences in profile() times on TinkerGraph, there is bound to be timing differences on the JVM for what I'm guessing is a test on a small dataset that resides in memory. Or perhaps for TinkerGraph there is no significant difference between the two approaches. Consider trying to execute the queries a few thousand times and average the time taken and compare that. Gremlin Console has a clock() function that helps with that. Of course, as I alluded to earlier what you learn there is no guarantee that you have the right solution on Neptune.

If you'd like a bit of analysis about your queries I could offer a few words (though I don't base this thinking on Neptune specifically). How each performs depends a lot on your graph structure, but I think I'd be the first query to be faster because it captures "blocked" vertices with:

.out("blocked").fold()

and re-use it over and over for however many V().hasLabel('user') there are. That's just a gut feeling though. I'm guessing the blocked list will be relatively small for a single user so traversing the opposing way with:

out("subscribed").in("blocked")

would just be more expensive as you would have to traverse a lot more "blocked" edges that don't terminate with the initial vertex.

stephen mallette
  • 45,298
  • 5
  • 67
  • 135