Using SparkGraphComputer to traverse a titan cluster throws an error

Question

I have a cluster set up with tinkerpop-3.1.1, titan-1.1.0-SNAPSHOT, spark-1.5.2 and hadoop-2.7.1 and run this script to reproduce an error:

graph = GraphFactory.open("hadoop-gryo.properties")

graph.traversal().V().count()

graph.traversal(computer(SparkGraphComputer)).V().next()

graph = GraphFactory.open("titan-cassandra-test-spark.properties")

graph.traversal().V().count()

graph.traversal(computer(SparkGraphComputer)).V().next()

The last call produces this error:

You must set the initial output address to a Cassandra node with setInputInitialAddress
Display stack trace? [yN] y
java.lang.IllegalStateException: You must set the initial output address to a Cassandra node with setInputInitialAddress
    at org.apache.tinkerpop.gremlin.hadoop.structure.io.HadoopElementIterator.<init>(HadoopElementIterator.java:71)
    at org.apache.tinkerpop.gremlin.hadoop.structure.io.HadoopVertexIterator.<init>(HadoopVertexIterator.java:36)
    at org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph.vertices(HadoopGraph.java:263)
    at org.apache.tinkerpop.gremlin.process.traversal.step.map.GraphStep.lambda$new$379(GraphStep.java:61)
    at org.apache.tinkerpop.gremlin.process.traversal.step.map.GraphStep.processNextStart(GraphStep.java:123)
    at org.apache.tinkerpop.gremlin.process.traversal.step.util.AbstractStep.next(AbstractStep.java:126)
    at org.apache.tinkerpop.gremlin.process.traversal.step.util.AbstractStep.next(AbstractStep.java:37)
    at org.apache.tinkerpop.gremlin.process.traversal.util.DefaultTraversal.next(DefaultTraversal.java:157)
    at java_util_Iterator$next.call(Unknown Source)
    at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:117)
    at groovysh_evaluate.run(groovysh_evaluate:3)
    at org.codehaus.groovy.vmplugin.v7.IndyInterface.selectMethod(IndyInterface.java:218)
    at org.codehaus.groovy.tools.shell.Interpreter.evaluate(Interpreter.groovy:70)
    at org.codehaus.groovy.tools.shell.Groovysh.execute(Groovysh.groovy:187)
    at org.codehaus.groovy.tools.shell.Shell.leftShift(Shell.groovy:122)
    at org.codehaus.groovy.tools.shell.ShellRunner.work(ShellRunner.groovy:95)
    at org.codehaus.groovy.tools.shell.InteractiveShellRunner.super$2$work(InteractiveShellRunner.groovy)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
    at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1210)
    at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:132)
    at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuper0(ScriptBytecodeAdapter.java:152)
    at org.codehaus.groovy.tools.shell.InteractiveShellRunner.work(InteractiveShellRunner.groovy:124)
    at org.codehaus.groovy.tools.shell.ShellRunner.run(ShellRunner.groovy:59)
    at org.codehaus.groovy.tools.shell.InteractiveShellRunner.super$2$run(InteractiveShellRunner.groovy)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
    at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1210)
    at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:132)
    at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuper0(ScriptBytecodeAdapter.java:152)
    at org.codehaus.groovy.tools.shell.InteractiveShellRunner.run(InteractiveShellRunner.groovy:83)
    at org.codehaus.groovy.vmplugin.v7.IndyInterface.selectMethod(IndyInterface.java:218)
    at org.apache.tinkerpop.gremlin.console.Console.<init>(Console.groovy:144)
    at org.codehaus.groovy.vmplugin.v7.IndyInterface.selectMethod(IndyInterface.java:218)
    at org.apache.tinkerpop.gremlin.console.Console.main(Console.groovy:305)
Caused by: java.lang.UnsupportedOperationException: You must set the initial output address to a Cassandra node with setInputInitialAddress
    at org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat.validateConfiguration(AbstractColumnFamilyInputFormat.java:84)
    at org.apache.cassandra.hadoop.ColumnFamilyInputFormat.validateConfiguration(ColumnFamilyInputFormat.java:74)
    at org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat.getSplits(AbstractColumnFamilyInputFormat.java:122)
    at com.thinkaurelius.titan.hadoop.formats.cassandra.CassandraBinaryInputFormat.getSplits(CassandraBinaryInputFormat.java:48)
    at com.thinkaurelius.titan.hadoop.formats.util.GiraphInputFormat.getSplits(GiraphInputFormat.java:48)
    at org.apache.tinkerpop.gremlin.hadoop.structure.io.HadoopElementIterator.<init>(HadoopElementIterator.java:66)
    ... 44 more

Strangely the hadoop-gryo.properties graph (which is admittedly local to the machine I execute on) can perform the required traversals. The error only occurs when I try to execute ANY traversal other than count on a hadoop graph pointing to a titan cluster (I have attached the config at the end). Is this a bug, or am I missing a setting?

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=com.thinkaurelius.titan.hadoop.formats.cassandra.CassandraInputFormat
#gremlin.hadoop.graphOutputFormat=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=/test/output
####################################
# Cassandra Cluster Config         #
####################################
titanmr.ioformat.conf.storage.backend=cassandrathrift
titanmr.ioformat.conf.storage.cassandra.keyspace=mindmapstest
titanmr.ioformat.conf.storage.hostname=lxd-cluster2-cassandra1,lxd-cluster2-cassandra2,lxd-cluster2-cassandra3
titanmr.ioformat.cf-name=edgestore
####################################
# SparkGraphComputer Configuration #
####################################
spark.master=spark://lxd-cluster2-cassandra1:7077
#spark.master=local[6]
spark.executor.memory=4g
spark.serializer=org.apache.spark.serializer.KryoSerializer
#spark.eventLog.enabled=true
####################################
# Apache Cassandra InputFormat configuration
####################################
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra.input.keyspace=mindmapstest
cassandra.input.predicate=0c00020b0001000000000b000200000000020003000800047fffffff0000
cassandra.input.columnfamily=edgestore
cassandra.range.batch.size=2147483647
cassandra.thrift.framed.size_mb=1024
####################################
# Hadoop Cluster configuration     #
####################################
fs.defaultFS=hdfs://lxd-cluster2-cassandra1:9000

Is there a nested stack trace that shows more info on what happened in Titan to cause the `IllegalStateException`? — Jason Plurad, Aug 05 '16 at 12:09
Hi Jason, thanks for taking a look. I have amended the stacktrace in the original question. — Sheldon, Aug 05 '16 at 12:20
What if you try using just a single host on `titanmr.ioformat.conf.storage.hostname=lxd-cluster2-cassandra1`? I'd suspect that [this line](https://github.com/thinkaurelius/titan/blob/1.0.0/titan-hadoop-parent/titan-hadoop-core/src/main/java/com/thinkaurelius/titan/hadoop/formats/cassandra/CassandraBinaryInputFormat.java#L66) is not handling the comma-separated values correctly. — Jason Plurad, Aug 05 '16 at 13:10
I tried the single host but it didn't work unfortunately. Thanks for the clue though, I will have a more in depth look and see what I can find. — Sheldon, Aug 05 '16 at 13:56
I can see that it is setting the cassandra.input.thrift.address to lxd-cluster2-cassandra1 at that line but still throws the error. I will try to determine what it checks that causes the error I am getting to see if it looks at a different property in the config. — Sheldon, Aug 05 '16 at 15:22
Instead of `g.V().next()`, you could do `g.V().valueMap(true).next()` or `g.V().valueMap(true).limit(1)`. — Jason Plurad, Aug 05 '16 at 21:02
Hi Jason, those two queries worked, is this a bug or something I am missing with respect to OLAP traversals? From the traversal it looks like you force the valueMap to be retrieved for EVERY vertex and then just pick one to display - is this to force it to put the vertices into spark from titan? — Sheldon, Aug 08 '16 at 09:03
I was able to reproduce the behavior you were seeing. I'd say go ahead and [open up an issue](https://github.com/thinkaurelius/titan/issues) for deeper investigation/explanation. The problem only surfaces when the result of the traversal is a list of vertices (or edges). If you return something other than a vertex, like the `valueMap()` or `id()`, the same traversal steps will return without error. You can use the same approach on individual vertices `g.V(512L).valueMap(true)` or multistep traversals `g.V().out().out().valueMap(true)`. — Jason Plurad, Aug 08 '16 at 12:23
Done. You can see the issue [here](https://github.com/thinkaurelius/titan/issues/1339) — Sheldon, Aug 08 '16 at 13:21

Using SparkGraphComputer to traverse a titan cluster throws an error

0 Answers0

Linked