I used the link Setup spark cluster and titan and cassandra to setup my topology. My topology is as follows:
VMs: Number 3: Cores 8 each RAM 16GB each.
The following is topology of VMs, along with its components of each:
Note: In bellow set of diagrams, "master" is same as "IP X"
Note that as per linked post, hdfs isn't required by Janusgraph 0.2.0 as it used TP 3.2.6 which has removed usage of hdfs as intermediate storage.
Now, if my understanding is correct, while I pushed data to my Cassandra cluster using Janusgraph, we keep its replication-factor as 3, so that it can be replicated across all nodes of Cassandra, on whom our Spark workers can then work on.
Considering that as truth, I pushed data into Janusgraph based on clustered Cassandra + Elasticsearch using the bellow properties file:
gremlin.graph=org.janusgraph.core.JanusGraphFactory
storage.backend=cassandrathrift
storage.hostname=IP A, IP B, IP C
storage.cassandra.keyspace=testDev
storage.cassandra.replication-factor=3
index.search.backend=elasticsearch
index.search.hostname=IP A, IP B, IP C
index.search.elasticsearch.client-only=true
The data pushed successfully, and I did following checks to verify it:
cqlsh shows keyspace with name testDec
when using same properties file, and doing OLTP based g.V().count, it returns me correctly.
Now, I want to introduce Spark Graph computer into the mix. I had tested a local Spark instance running and doing OLAP against cassandra + elasticsearch all hosted locally in single VM. It worked perfectly, though it was painstrikkingly slow for my test data which is okay. But when I introduce cassandra cluster into mix, my spark job / tasks simple doesn't start (As inferred from UI).
The following is my properties file I use to create Hadoop graph from Cassandra backend and do OLAP on it.:
#
# Hadoop Graph Configuration
#
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=org.janusgraph.hadoop.formats.cassandra.CassandraInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output
#
# JanusGraph Cassandra InputFormat configuration
#
janusgraphmr.ioformat.conf.storage.backend=cassandrathrift
janusgraphmr.ioformat.conf.storage.hostname=IP A, IP B, IP C
janusgraphmr.ioformat.conf.storage.cassandra.keyspace=testDev
janusgraphmr.ioformat.cf-name=edgestore
#
# Apache Cassandra InputFormat configuration
#
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra.input.keyspace=testDev
cassandra.input.predicate=0c00020b0001000000000b000200000000020003000800047fffffff0000
cassandra.input.columnfamily=edgestore
cassandra.range.batch.size=2147483647
#
# SparkGraphComputer Configuration
#
spark.master=spark://IP X:7077
spark.executor.cores=3
spark.executor.memory=6g
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.executorEnv.HADOOP_CONF_DIR=/home/hadoop/hadoop/etc/hadoop
spark.executorEnv.SPARK_CONF_DIR=/home/spark/spark/conf
spark.driverEnv.HADOOP_CONF_DIR=/home/hadoop/hadoop/etc/hadoop
spark.driverEnv.SPARK_CONF_DIR=/home/spark/spark/conf
spark.driver.extraLibraryPath=/home/hadoop/hadoop/lib/native
spark.executor.extraLibraryPath=/home/hadoop/hadoop/lib/native
gremlin.spark.persistContext=true
# Default Graph Computer
gremlin.hadoop.defaultGraphComputer=org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer
Following posts from here and here, it looks like my configuration is okay (.properties file) but what happens is that my JOB is lost everytime I do any OLAP queries. I don't see any tasks starting from UI, and after a really long time, I get the error stack trace.
I initially thought it was some error with the way I setup spark standalone cluster, but then I tried doing OLAP by reading graph from file system.
I read a GraphSON file using following properties:
#
# Hadoop Graph Configuration
#
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=/opt/JanusGraph/0.2.0/data/grateful-dead.json
gremlin.hadoop.outputLocation=output
#
# SparkGraphComputer Configuration
#
spark.master=spark://IP X:7077
spark.executor.cores=2
spark.executor.memory=4g
spark.serializer=org.apache.spark.serializer.KryoSerializer
gremlin.spark.persistContext=true
# Default Graph Computer
gremlin.hadoop.defaultGraphComputer=org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer
When I load above properties file as :
graph = GrpahFactory.open(/conf.properties")
g = graph.traversal().withComputer(SparkGraphComputer)
g.V().count()
It returns the expected output, I'm able to see Spark job starting on UI, as well as Stages. That effectively means that the Spark OLAP query ran successfully.
If that is the case, it looks like whereas connection to Spark is established, I'm unable to read in data from underlaying Cassandra nodes. Why is that happening?
Any directions will be grately appreceated!
Any more information needed, let me know, and the same can be added here.
Cheers :-)