How do i arrange Single cardinality for Vertex properties imported via CSV into AWS Neptune?

Question

Neptune documentation says they support "Set" property cardinality only on property data imported via CSV, which means there is no way that a newly arrived property value could overwrite the old property value on the same vertex, on the same property.

For example, if the first CSV imports

~id,~label,age
Marko,person,29

then Marko has a birthday & a second CSV imports

~id,~label,age
Marko,person,30

'Marko' vertex 'age' property will contain both age values, which doesn't seem useful.

AWS says this (collapsing Set to Single cardinality properties (keeping the last arrived value only) needs to be done with post-processing, via Gremlin traversals.

Does this mean that there should be a traversal that continuously scanning Vertexes with multiple (Set) properties and set the property once again with Single cardinality, with the last value possible? IF so, what is the optimal Gremlin query to do do that?

In pseudo-Gremlin i'd imagine something like:

g.V().property(single, properties(*), _.tail())

Is there a guarantee at all that Set-cardinality properties are always listed in order of arrival?

Or am i completely on the wrong track here.

Any help would be appreciated.

Update: So the best thing i was able to come with up so far is still far from a perfect solution, but it still might be useful for someone in my shoes.

In Plan A if we happen to know the property names and the order of arrival does not matter at all (just want single cardinality on these props), the traversal for all vertexes could be something like:

g.V().has(${propname}).where(property(single, ${propname}, properties(${propname}).value().order().tail() ) )

The plan B is to collect new property values under temporary property names in the same vertex (eg. starting with _), and traverse through vertexes having such temporary property names and set original properties with their tailed values with single cardinality:

g.V().has(${temp_propname}).where(property(single, ${propname}, properties(${temp_propname}).value().order().tail() ) ).properties('temp_propname').drop()

The Plan C, which would be the coolest, but unfortunately does not work, is to keep collecting property values in a dedicated vertex, with epoch timestamps as property names, and property values as their values:

g.V(${vertexid}).out('has_propnames').properties()
==>vp[1542827843->value1]
==>vp[1542827798->value2]
==>vp[1542887080->latestvalue]

and sort the property names (keys), take the last one, and use its value to keep THE main vertex property value up-to-date with the latest value:

g.V().has(${propname}).where(out(${has_these_properties}).count().is(gt(0))).where(property(single, ${propname}, out(${has_these_properties}).properties().value(  out(${has_these_properties}).properties().keys().order().tail()  ) ) )

Looks like the parameter for value() step must be constant, it can't use the outcome of another traversal as parameter, so i could not get this working. Perhaps someone with more Gremlin experience know a workaround for this.

score 2 · Accepted Answer · answered Nov 19 '19 at 12:29

2

AWS have recently introduced 'single' cardinality support on CSV bulk loader: https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load-tutorial-format-gremlin.html So no more Gremlin-level property value arrangement should be needed.

answered Nov 19 '19 at 12:29

Balazs David Molnar

85
9

score 0 · Answer 2 · answered Nov 20 '18 at 18:54

0

It would probably be more performant to read in the file from which you are bulk loading and set that property using the vertex id, rather than scanning for a vertex with multiple values for that property.

So your gremlin update query would be as follows.

g.V(${id})
 .property(single,${key},${value})

In so far as whether set is a guaranteed order, I do not know. :(

answered Nov 20 '18 at 18:54

Dave Zabriskie

180
6

Thank you for your answer! The problem is that vertexes in my setup arrive very fast, CSVs containing over 100.000 vertex arrive in each minute (and get processed in 2-3 seconds, so that works amazingly fast) and that's only the beginning. On the other hand i see gremlin queries complete in 10-1000ms range so i'm afraid if i started to send a property update gremlin query for each vertex by their id's one by one in that volume, i'd probably have massive backlog in no time. – Balazs David Molnar Nov 20 '18 at 22:34
Yes, it might not keep up without some further optimization. You would think that since they allow a distinction between single and array types in the bulk load headers that it would factor into Single vs Set. Maybe in a newer version if enough people request it. – Dave Zabriskie Nov 21 '18 at 16:17

How do i arrange Single cardinality for Vertex properties imported via CSV into AWS Neptune?

2 Answers2