5

The graph size is in the billions of nodes, and tens of billions of vertices.

It will store webpages urls, and links between webpages and it will be used for testing ranking algorithms.

Any language is fine but java is prefered.

Solutions i found so far:

  1. neo4j
  2. storing in sorted flat files

Yes, i have already read Best Way to Store/Access a Directed Graph.

Update

The data can be distributed on multiple computers and does not need to be fully in-memory.

Community
  • 1
  • 1
elhoim
  • 6,705
  • 2
  • 23
  • 29
  • Your question is somewhat vague, do you actually need access to all of the dataset? Or are you just visualizing the connection nodes. What I usually do, and what many in my field do is simply take a broad calculated sampling of the data and then display it. This may not be an accurate enough approach depending on your need – slimbo Oct 06 '09 at 19:46
  • Do you need to keep your data in memory ? If so, just forget it... Clarify your question, or consider using off-memory storage (indexed database). – NewbiZ Oct 06 '09 at 19:55
  • @Steve: the ranking algorithm needs to scan all the links to output a value per link. So using a subset/sample does not work. – elhoim Oct 06 '09 at 20:53
  • 1
    Couldn't you just use the ranking algorithm to scan the links, link by link. Generate a new table with link and value. Then show a subset of that new table. i.e. 500 links -> new file with 500 links and 500 values -> display 100 links on the graph that best represent the data Running the script and displaying the data at once would be nearly impossible using RAM because of the amount of data. It would be better to process the data and then display a representation of that data. – slimbo Oct 06 '09 at 21:42
  • The ranking i algorithm i talk about are not able to work on "flows" of links. They work in a complete graph in a point in time. – elhoim Oct 07 '09 at 05:38

1 Answers1

2

Depending on your implementation, another solution could be Terracotta. I think supports object graphs of this magnitude using a distributed virtual heap.

http://www.terracotta.org/web/display/docs/Concept+and+Architecture+Guide#ConceptandArchitectureGuide-VirtualHeap

spieden
  • 1,239
  • 1
  • 10
  • 23