Neptune documentation says they support "Set" property cardinality only on property data imported via CSV, which means there is no way that a newly arrived property value could overwrite the old property value on the same vertex, on the same property.
For example, if the first CSV imports
~id,~label,age
Marko,person,29
then Marko has a birthday & a second CSV imports
~id,~label,age
Marko,person,30
'Marko' vertex 'age' property will contain both age values, which doesn't seem useful.
AWS says this (collapsing Set to Single cardinality properties (keeping the last arrived value only) needs to be done with post-processing, via Gremlin traversals.
Does this mean that there should be a traversal that continuously scanning Vertexes with multiple (Set) properties and set the property once again with Single cardinality, with the last value possible? IF so, what is the optimal Gremlin query to do do that?
In pseudo-Gremlin i'd imagine something like:
g.V().property(single, properties(*), _.tail())
Is there a guarantee at all that Set-cardinality properties are always listed in order of arrival?
Or am i completely on the wrong track here.
Any help would be appreciated.
Update: So the best thing i was able to come with up so far is still far from a perfect solution, but it still might be useful for someone in my shoes.
In Plan A if we happen to know the property names and the order of arrival does not matter at all (just want single cardinality on these props), the traversal for all vertexes could be something like:
g.V().has(${propname}).where(property(single, ${propname}, properties(${propname}).value().order().tail() ) )
The plan B is to collect new property values under temporary property names in the same vertex (eg. starting with _), and traverse through vertexes having such temporary property names and set original properties with their tailed values with single cardinality:
g.V().has(${temp_propname}).where(property(single, ${propname}, properties(${temp_propname}).value().order().tail() ) ).properties('temp_propname').drop()
The Plan C, which would be the coolest, but unfortunately does not work, is to keep collecting property values in a dedicated vertex, with epoch timestamps as property names, and property values as their values:
g.V(${vertexid}).out('has_propnames').properties()
==>vp[1542827843->value1]
==>vp[1542827798->value2]
==>vp[1542887080->latestvalue]
and sort the property names (keys), take the last one, and use its value to keep THE main vertex property value up-to-date with the latest value:
g.V().has(${propname}).where(out(${has_these_properties}).count().is(gt(0))).where(property(single, ${propname}, out(${has_these_properties}).properties().value( out(${has_these_properties}).properties().keys().order().tail() ) ) )
Looks like the parameter for value() step must be constant, it can't use the outcome of another traversal as parameter, so i could not get this working. Perhaps someone with more Gremlin experience know a workaround for this.