We know that we can use adjacency list or adjacency matrix for algorithm on graph. It is pretty easy and straightforward for small graphs. But when the graph is very big, for example social network graph, what data structure should be best for implementing traditional algorithms like shortest path finding. Adjacency matrix or list won't work because of high memory requirements, right? What approach do social network engines use?
-
Adjacency lists are somewhat optimal w.r.t space complexity (at least asymptotically). The best you can hope for is getting it down by some constant factor using some succinct representation – Niklas B. Apr 01 '14 at 04:33
-
@NiklasB., I think the OP is asking about graphs that are too big to fit inside any single data structure in ram, such as the graph of friends in Facebook. – Codie CodeMonkey Apr 01 '14 at 05:03
-
@CodieCodeMonkey Who says anything about RAM? What I said applies just as well if you store the graph on disk. I guess "Adjacency [...] list won't work because of high memory requirements" is a misconception which I wanted to resolve – Niklas B. Apr 01 '14 at 05:05
-
@NiklasB. That was my point! – Codie CodeMonkey Apr 01 '14 at 05:06
-
1@CodieCodeMonkey That you can store stuff on disk instead of keeping it all in memory? That's kinda obvious I guess, but OP might certainly have some misconception about that as well. It's impossible to tell what the actual problem is here – Niklas B. Apr 01 '14 at 05:08
-
@NiklasB., Ok, I see that you meant a broader definition of adjacency list than I assumed. – Codie CodeMonkey Apr 01 '14 at 05:12
-
1On a side node, modern machines with terabytes of RAM should easily be able to handle graphs in-memory with even billions of edges, like Facebook friends or Twitter subscribers – Niklas B. Apr 01 '14 at 05:21
1 Answers
Adjacency lists are in use in the sources I have found. For very large data sizes you might end up either holding the data on disk or using multiple machines to solve the problem - so I suggest adding keywords such as "external memory" or Hadoop to the search. I tried adding Hadoop and found some papers on solving single source shortest path via parallel breadth first search - http://www.cs.kent.edu/~jin/Cloud12Spring/GraphAlgorithms.pptx, http://courses.cs.washington.edu/courses/cse490h/08au/lectures/algorithms.pdf, Hadoop MapReduce implementation of shortest PATH in a graph, not just the distance
In addition, http://researcher.watson.ibm.com/researcher/files/us-heq/Large%20Scale%20Graph%20Processing%20with%20Apache%20Giraph.pdf does not cover shortest path but is an interesting example of solving connected components using a layer on top of Hadoop that may make life easier.