2

I have a very basic news feed modelled in IBM Graph (TitanDB backed by Cassandra) as shown below:

enter image description here

I am trying to write a query that does the following:

  1. Start at vertex USER: John.Smith
  2. Get the 15 most recent posts from the users FRIENDS combined with his own.
  3. Check to see if USER: John.Smith likes any of those posts and return as a simple is_liked boolean property for each post.

There are a couple of pre-requisites for this query:

  • In each returned post, the properties of the posting USER should also be returned. For the sake of this question, only the avatar property is required.
  • I need to be able to paginate these results. i.e. Once I have retrieved the top 15 posts, I then need to be able to return the next 15, then the next etc.

I have no problem getting the users friends, and their LATEST_POSTS:

g.V().hasLabel("USER").has("userid", "John.Smith").both("FRIEND").out("LATEST_POST");

I have read the Tinkerpop documentation but am finding myself still lost as to how to begin building upon this query in order to meet my requirements.

Also, any commentary on this approach in terms of performance, data modelling, schema or indexing advice would be extremely helpful. i.e Should I expect this approach to be able to retrieve feeds in real-time at scale?

Thanks in advance.

gordyr
  • 6,078
  • 14
  • 65
  • 123

2 Answers2

4

For the given graph schema, the query would be something like this:

g.V().has("user", "userid", "John.Smith").as("john").
  union(identity(), both("FRIEND")).as("user").
  out("LATEST_POST").
  flatMap(emit().repeat(out("PREVIOUS_POST")).range(page * pageSize, (page + 1) * pageSize)).as("post").
  choose(__.in("LIKED").where(eq("john")), constant(true), constant(false)).as("likedByJohn")
  select("user", "post", "likedByJohn")

But Alaa already pointed out that this approach won't scale and how you could improve your graph schema.

Daniel Kuppitz
  • 10,846
  • 1
  • 25
  • 34
  • Thanks. This is great as it correctly answers my question as presented. But I am still confused by Alaa's suggestion. Should I be connecting each post, directly to it's author via an edge, as opposed to the linked-list scheme that I am currently using? And if so, does that mean that the query will have to scan through every post of every friend just in order to get the top 15 newsfeed items? As I mentioned, IBM graph doesn't appear to have vertex-centric indexes, so surely this wouldn't scale either? Or am I wrong? – gordyr Oct 25 '16 at 12:29
  • Yea, sorry, I wasn't aware of the lack of vertex-centric indexes. In that case the other schema probably won't add much value. I really have no idea why they don't support vertex-centric indexes; maybe one of their devs can comment on how they think those kind of problems should be solved. – Daniel Kuppitz Oct 25 '16 at 13:00
  • 2
    Unfortunately, we haven't had the time to support vertex-centric indexes yet due to priorities. From our experience working with our users, traversing large graphs hasn't been an issue especially when we are talking a few 100 and even a few 1000 vertexes per edge . We have users who are loading a few billion nodes and are having decent throughput. I'd suggest you make use of mixed indexes to speed up your traversal. You could stamp latest posts with a boolean property (latest) and index it. Now you can easily write queries to retrieve vertexes that `has('latest',true)` to get back latest posts – Alaa Mahmoud Oct 25 '16 at 14:39
1

You should check the pagination recipe in http://tinkerpop.apache.org/docs/3.2.3-SNAPSHOT/recipes/#pagination. Here's a simplified way to retrieve one range/page at a time

gremlin> g.V().hasLabel('person').range(0,2)
==>v[1]
==>v[2]
gremlin> g.V().hasLabel('person').range(2,4)
==>v[4]
==>v[6]

Regarding the model you have , I would avoid using a LATEST_POST edge as you will need to keep updating this edge everytime a user has a new post. It's better to add a timestamp property to the post and you can always sort your returned results on the timestamp to get the latest post.

Alaa Mahmoud
  • 743
  • 3
  • 18
  • 1
    Thanks Alaa. I'm aware of the range() step, but i'm not sure how I would apply it inline with my other requirements. I modelled the feeds as a linked list with a LATEST_POST node, primarily due to the fact that IBM Graph doesn't appear to have vertex-centrix indexes yet. I was concerned that doing as you suggest would cause the query to have to scan each post, of every friend, in order to collect and order them. And would this not cause a huge number of posts to be attached to each user over time, making all queries progressively slower? Perhaps I am misunderstanding something. – gordyr Oct 24 '16 at 18:37
  • Or did you mean keep the linked list model for a users timeline, and simply have all of them attached to each other via a POSTED edge along with a timestamp on the POST vertexes? My thinking was that I could just drop the current LATEST_POST edge, and attach the new post. Then attach the rest of the potentially long list to the new post. Attaching each post directly to the USER would indeed make my query more simple, but as mentioned, i'm concerned about query performance deteriorating over time as users add more posts. – gordyr Oct 24 '16 at 18:43
  • What the range step does is filter the vertexes from index x to index y. So if you have a query that returns all the POST vertexes , adding a range to it will only return a subset of the vertexes. `g.V().hasLabel("USER").has("userid", "John.Smith").both("FRIEND").out("LATEST_POST").range(0,5)` will return the first 5 posts, You can't have your gremlin query return a set of data then return another set , etc... in one gremlin call. What you could do is send the range as a parameter in your query so that each call returns a subset of your data. – Alaa Mahmoud Oct 24 '16 at 19:22
  • Okay, I think I understand. Does that mean that there is no way to return a subset of a users friends posts, without traversing every single post of every single friend then? So if John.Smith has 100 friends, each with 1000 posts, in order to retrieve the oldest posts in a users feed, I would have to Traverse 100,000 nodes? – gordyr Oct 24 '16 at 20:30
  • Right. But you should create an index for the timestamp property when you create your schema. I think creating an index on the time property will translate internally into an index lookup instead of traversing the all nodes. – Alaa Mahmoud Oct 25 '16 at 12:16
  • If you are worried about traversing all nodes, then you can use your latest_post method above. Another way to do this is to have a background process that groups posts into tiers, `today`, `this week`, `this month` and `archive` where each user is connected to a vertex that indicates which tier it is. One vertex for today, one for this week, etc...And then you attach the posts to the appropriate vertex. This way if you want the latest, you only look at posts attached to the `today` vertex – Alaa Mahmoud Oct 25 '16 at 12:17