3

I have loaded very large graph in TItan 1.0.0 with backend Cassandra 2.1.13. I have to perform some operations on the graphs using Spark.

For example,

  1. I want to find subgraphs in a very large graph using Apache Spark
  2. I want to run some clustering (machine learning code) on graph stored in Titan,etc.

Basically, I will be applying some algorithm on TitanGraph using Spark (which I suppose will be faster on a big graph)

I am able to find the any docs relating this, how to process the graph. Is the Spark a right approach to apply algorithms(Machine learning) on large graph? What should be my next steps? How do I run my Spark code on Titan? (I am not able to find the exact methods or function through which I should be inserting/using Spark code?

Any help is appreciated.

Amnesiac
  • 661
  • 1
  • 10
  • 30

2 Answers2

0

Have you had a look at SparkGraphComputer? This helps you apply Gremlin queries that will be executed on Spark framework. Have a look at this example:

gremlin> graph = GraphFactory.open('conf/hadoop/hadoop-gryo.properties')
==>hadoopgraph[gryoinputformat->gryooutputformat]
gremlin> g = graph.traversal(computer(SparkGraphComputer))
==>graphtraversalsource[hadoopgraph[gryoinputformat->gryooutputformat], sparkgraphcomputer]
gremlin> :remote connect tinkerpop.hadoop graph g
==>useTraversalSource=graphtraversalsource[hadoopgraph[gryoinputformat->gryooutputformat], sparkgraphcomputer]
==>useSugar=false
gremlin> :> g.V().group().by{it.value('name')[1]}.by('name')
==>[a:[marko, vadas], e:[peter], i:[ripple], o:[josh, lop]]

Another way to go is to use the GraphComputer. This helps you a lot on applying OLAP and OLTP on the graph using Spark/Hadoop. Here is an example

gremlin> result = graph.compute().program(PageRankVertexProgram.build().create()).submit().get()
==>result[tinkergraph[vertices:6 edges:0],memory[size:0]]
gremlin> result.memory().runtime
==>95
gremlin> g = result.graph().traversal(standard())
==>graphtraversalsource[tinkergraph[vertices:6 edges:0], standard]
gremlin> g.V().valueMap('name',PageRankVertexProgram.PAGE_RANK)
==>[gremlin.pageRankVertexProgram.pageRank:[0.15000000000000002], name:[marko]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.19250000000000003], name:[vadas]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.4018125], name:[lop]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.19250000000000003], name:[josh]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.23181250000000003], name:[ripple]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.15000000000000002], name:[peter]]
Mohamed Taher Alrefaie
  • 15,698
  • 9
  • 48
  • 66
0

Consider using mizo for OLAP of Titan using spark -- this answer might be helpful.

Community
  • 1
  • 1
imriqwe
  • 1,455
  • 11
  • 15