0

I am working on a project that involves a RandomWalk on a large graph(too big to fit in memory). I coded it in Python using networkx but soon, the graph became too big to fit in memory, and so I realised that I needed to switch to a distributed system. So, I understand the following:

  1. I will need to use a graph database as such(Titan, neo4j, etc)
  2. A graph processing framework such as Apache Giraph on hadoop/ graphx on spark.

Firstly, are there enough APIs to allow me to continue to code in Python, or should I switch to Java?

Secondly, I couldn't find exact documentation on how I can write my custom function of traversal(in either Giraph or graphx) in order to implement the Random Walk algorithm.

Community
  • 1
  • 1
Aneesh Makala
  • 341
  • 2
  • 9

1 Answers1

0

My understanding is, you need to process large graphs which are stored on file systems. There are various distributed graph processing frameworks like Pregel, Pregel+, GraphX, GPS(Stanford), Mizan, PowerGraph etc.

It is worth taking a look at these frameworks. I will suggest coding in C, C++ using openMPI like which can help achieve better efficiency.

Frameworks in Java are not very memory efficient. I am not sure of API of these frameworks in Python.

It is worth taking a look at blogs and papers which give a comparative analysis of these frameworks before deciding on implementing them.

user1211
  • 1,507
  • 1
  • 18
  • 27