0

UPDATE 1

I've added the descLength and imageLength properties to allow for easier sorting. The idea is that constant(0) can be used to fill in the values for users who lack either property, and any length greater than 0 can be used to identify a user who actually has the property. The furthest this gets me is being able to order().by() only one property at a time, using a query such as:

g.V().
  order().
    by(coalesce(values('descLength'), constant(0)))

But this isn't the full solution to match what I need.


Original Post

In amazon neptune I want to sort vertices based on the presence of 2 properties, desc and image. The order of ranking should be:

  • vertices that have both properties
  • vertices that have desc but not image
  • vertices that have image but not desc
  • vertices that have neither property

Consider this graph of users and their properties:

g.addV('user').property('type','person').as('u1').
  addV('user').property('type','person').property('desc', 'second person').property('descLength', 13).as('u2').
  addV('user').property('type','person').property('desc', 'third person').property('descLength', 12).property('image', 'https://www.example.com/image-3.jpeg').property('imageLength', 36).as('u3').
  addV('user').property('type','person').property('image', 'https://www.example.com/image-4.jpeg').property('imageLength', 36).as('u4')

Using the ranking order I outlined, the results should be:

  • u3 because it has both desc and image
  • u2 because it has desc but not image
  • u4 because it has image but not desc
  • u1 because it has neither desc nor image

The order().by() samples I've seen work with data like numbers and dates that can be ranked by increasing/decreasing values, but of course strings like urls and text can't. What's the correct way to achieve this?

Uche Ozoemena
  • 816
  • 2
  • 10
  • 25

1 Answers1

2

This first query is not exactly what you are looking for as it treats 'image' and 'desc' as the same weighting, but with this foundation, it should be possible to build out any variations of the query to better meet your needs.

Given:

g.V().hasLabel('user').
      project('id','data').
        by(id).
        by(values('desc','image').fold()).
  order().
    by(select('data').count(local),desc)

we get

{'id': '92c04ae3-5a7f-ea4c-e74f-e7f79b44ad3a', 'data': ['third person', 'https://www.example.com/image-3.jpeg']}
{'id': 'e8c04ae3-5a7f-2cfb-cc28-cd663bd58ef9', 'data': ['second person']}
{'id': 'c8c04ae3-5a80-5707-8ba6-56554de98f33', 'data': ['https://www.example.com/image-4.jpeg']}
{'id': 'a6c04ae3-5a7e-fd0f-1197-17f3ce44595f', 'data': []}

Building on this, we can go one step further and calculate a score based on how many of the properties exist in each case. The query below gives desc a higher score than image so in the cases where they do not both exist, desc will sort higher.

g.V().hasLabel('user').
      project('id','data','score').
        by(id).
        by(values('desc','image').fold()).
        by(union(
             has('desc').constant(2),
             has('image').constant(1),
             constant(0)).
            sum()).
  order().
    by(select('score'),desc)

which yields

{'id': '92c04ae3-5a7f-ea4c-e74f-e7f79b44ad3a', 'data': ['third person', 'https://www.example.com/image-3.jpeg'], 'score': 3}
{'id': 'e8c04ae3-5a7f-2cfb-cc28-cd663bd58ef9', 'data': ['second person'], 'score': 2}
{'id': 'c8c04ae3-5a80-5707-8ba6-56554de98f33', 'data': ['https://www.example.com/image-4.jpeg'], 'score': 1}
{'id': 'a6c04ae3-5a7e-fd0f-1197-17f3ce44595f', 'data': [], 'score': 0}

UPDATED 2022-05-06 To show how to get just the ID

Taking the query above, to get the ID from the results is as simple as adding a select('id') at the end of he query.

g.V().hasLabel('user').
      project('id','data','score').
        by(id).
        by(values('desc','image').fold()).
        by(union(
             has('desc').constant(2),
             has('image').constant(1),
             constant(0)).
            sum()).
  order().
    by(select('score'),desc).
  select('id')

However, we can also remove some of the other work the query is doing to fetch the results. I mainly included those for demonstration purposes. So we can reduce the query to:

g.V().hasLabel('user').
      project('id','score').
        by(id).
        by(union(
             has('desc').constant(2),
             has('image').constant(1),
             constant(0)).
            sum()).
  order().
    by(select('score'),desc).
  select('id')
Kelvin Lawrence
  • 14,674
  • 2
  • 16
  • 38
  • Thanks! Both queries work as described. Can you think of ways to optimize them to reduce the time it takes to complete? Sometimes they run for between 15-30s, I'd like to get them consistently below 10s for a start. – Uche Ozoemena May 06 '22 at 12:46
  • Also how can I extract only the `id`s from the results? I tried appending `.id()` after the last `order().by()` but it gave me `"code":"UnsupportedOperationException","detailedMessage":"java.util.LinkedHashMap cannot be cast to org.apache.tinkerpop.gremlin.structure.Element"`. I then tried using `.values(t.id)`, `.values('id')`, and `.toList().id()` but still couldn't get it to work. – Uche Ozoemena May 06 '22 at 13:16
  • 1
    On performance it depends on how much data you are having to fetch. You could reduce some of the data being fetched and see if that helps. As written the query is looking at every vertex in the graph as it starts with `g.V()` - if you can add filters to reduce the data inspected that will almost certainly improve the performance. I will edit the answer to show how to just get the `id` back, but in the query above just doing `select('id')` will work. – Kelvin Lawrence May 06 '22 at 13:57
  • Awesome thanks! Yes that extra `select('id')` was what I needed. – Uche Ozoemena May 07 '22 at 14:18