GraphX - Best way to store and compute over 3 billion vertices

Question

I am new to Spark and GraphX. So far I have been using Titan DB (HBase storage) and Giraph for processing. I have a requirement to have a graph with ~3 Billion Vertices and ~5 billion Edges. What would be the best way to store the graph(create the graph from scratch by adding vertices and edges, Also I want to move away from titan API for graph creation). I am not able to find any direct documentation around this. Can you suggest me what would be the best way to create/store my graph and process using GraphX, with commodity hardware?

Thanks.

The [GraphX Programming Guide](http://spark.apache.org/docs/latest/graphx-programming-guide.html) covers the creation and processing of graphs. What do you specifically want to know more about? — Daniel Darabos, Feb 06 '15 at 09:08
The Programming Guide shows how to read from HDFS file and process. I was just checking is there any reference available which uses HBase to store vertices and edges and process on top of it. Also, It would be great to have if any example available using Java. — Ashok Krishnamoorthy, Feb 06 '15 at 10:11

score 2 · Answer 1 · answered Feb 06 '15 at 10:41

2

As long as you can read HBase Tables into RDD (which you can), there should be no issue. Check out the HBaseTest Example (it's in the Spark distribution) will probably help you further.

answered Feb 06 '15 at 10:41

Sietse

201
1
4

GraphX - Best way to store and compute over 3 billion vertices

1 Answers1