Large graph processing on Hadoop

Question

I am working on a project that involves a RandomWalk on a large graph(too big to fit in memory). I coded it in Python using networkx but soon, the graph became too big to fit in memory, and so I realised that I needed to switch to a distributed system. So, I understand the following:

I will need to use a graph database as such(Titan, neo4j, etc)
A graph processing framework such as Apache Giraph on hadoop/ graphx on spark.

Firstly, are there enough APIs to allow me to continue to code in Python, or should I switch to Java?

Secondly, I couldn't find exact documentation on how I can write my custom function of traversal(in either Giraph or graphx) in order to implement the Random Walk algorithm.

You can write graphx in python. Might be worth looking at https://graphframes.github.io/. — Binary Nerd, Jan 11 '17 at 07:55

score 0 · Answer 1 · answered Jun 04 '17 at 18:35

My understanding is, you need to process large graphs which are stored on file systems. There are various distributed graph processing frameworks like Pregel, Pregel+, GraphX, GPS(Stanford), Mizan, PowerGraph etc.

It is worth taking a look at these frameworks. I will suggest coding in C, C++ using openMPI like which can help achieve better efficiency.

Frameworks in Java are not very memory efficient. I am not sure of API of these frameworks in Python.

It is worth taking a look at blogs and papers which give a comparative analysis of these frameworks before deciding on implementing them.

Large graph processing on Hadoop

1 Answers1