Distributed local clustering coefficient algorithm (MapReduce/Hadoop)

Question

I have implemented MapReduce paradigm based local clustering coefficient algorithm. However I have run into serious troubles for bigger datasets or specific datasets (high average degree of a node). I tried to tune my hadoop platform and the code however the results were unsatisfactory (to say the least). No I have turned my attention to actually change/improve the algorithm. Below is my current algorithm (pseudo code)

foreach(Node in Graph) {
  //Job1
  /* Transform edge-based input dataset to node-based dataset */

  //Job2
  map() {
   emit(this.Node, this.Node.neighbours) //emit myself data to all my neighbours
   emit(this.Node, this.Node) //emit myself to myself
  }

  reduce() {
    NodeNeighbourhood nodeNeighbourhood;
    while(values.hasNext) {
      if(myself)
        this.nodeNeighbourhood.setCentralNode(values.next) //store myself data
      else
        this.nodeNeighbourhood.addNeighbour(values.next)  //store neighbour data
    }

    emit(null, this.nodeNeighbourhood)
  }

  //Job3
  map() {
    float lcc = calculateLocalCC(this.nodeNeighbourhood)
    emit(0, lcc) //emit all lcc to specific key, combiners are used
  }

  reduce() {
    float combinedLCC;
    int numberOfNodes;
    while(values.hasNext) {
      combinedLCC += values.next;
    }

    emit(null, combinedLCC/numberOfNodes); // store graph average local clustering coefficient
  }
}

Little bit more details about the code. For directed graphs neighbour data is restricted to node ID and OUT edges destination IDs (to decrease the data size), for undirected its also node ID and edges destination IDs. Sort and Merge buffers are increased to 1.5 Gb, merge streams 80.

It can be clearly seen that Job2 is the actual problem of the whole algorithm. It generates massive amount of data that has to be sorted/copied/merged. This basically kills my algorithm performance for certain datasets. Can someone guide me on how to improve the algorithm (I was thinking about creating an iterative Job2 ["process" only M nodes out of N in each iteration until every node is "processed"], but I have abandoned this idea for now). In my opinion the Job2 map-output should be decreased, to avoid costly sort/merge processes, which kill the performance.

I have also implemented the same algorithm (3 Jobs as well, same "communication" pattern, also "Job2" problem) for the Giraph platform. However Giraph is an in-memory platform and the algorithm for the same "problematic" datasets results in an OutOfMemoryException.

For any comment, remark, guideline I will be grateful.

UPDATE

I'm going to change the algorithm "drastically". I've found this article Counting Triangles.

Once the code is implemented I'm gonna post my opinion here and more detailed code (if this approach will be successful).

UPDATE_2

In the end I've ended "modifying" NodeIterator++ algorithm to my needs (Yahoo paper is available through a link in the article). Unfortunately though I can see an improvement in the performance the end result is not as good as I have hoped. The conclusion I have reached is that the cluster which is available to me is just too small to make the LCC calculations feasible for these specific datasets. So the question remains, or rather it evolves. Does any one know of an efficient distributed/sequential algorithm for calculating LCC or triangles with limited resources available? (By no means I am stating that the NodeIterator++ algorithm is bad, I simple state that the resources which are available to me are just not sufficient).

just shooting in the dark.. have you tried [mahout's clustering job](https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/graph/common/LocalClusteringCoefficientJob.html) — Hari Menon, Jun 10 '12 at 12:30
Unfortunately mahouts LocalClusteringCoefficientJob is not what I am looking for (only works for undirected graphs and some other issues). Job2 reduce() emits NodeNeighbourhood, which is central Node data and neighbours "restricted" data. — alien01, Jun 11 '12 at 07:37
I was myself about to suggest you the vertica blog earlier but somehow it skipped my mind. Did you also try the pig version mentioned there? What is your data size and cluster config? — Hari Menon, Jun 14 '12 at 13:54
I can only use Hadoop and Giraph (project requirements). The cluster is 20 workers (+master, +3speculative), 8-core, 24Gb RAM. Dataset is not big 91mb (filtered biggest connected component WikiTalk, can be found in SNAP repository [http://snap.stanford.edu/data/wiki-Talk.html ]). But a "couple" of vertices are highly connected, thus the issue. Though I still can not believe I can't process it in a feasible time. Just to show the "scale" of the problem. I've run same calculations (LCC) with the use of Neo4J, for a number of size similar datasets. Average time was 40-80s. WikiTalk around 10h. — alien01, Jun 14 '12 at 20:47
WikiTalk is just one of the problematic datasets, only one publicly available. (Sorry I couldn't fit this in one comment). — alien01, Jun 14 '12 at 20:52
So if I understand right, calculating LCC on 91 mb of data takes 10h on an 24-node 24G cluster? Can you add the pseudo code with the Yahoo algorithm? Also what is the output size and number of output files after each job? — Hari Menon, Jun 15 '12 at 03:59
This is not Hadoop, but may be interesting. [graph-tool local clustering implementation](http://graph-tool.skewed.de/static/doc/clustering.html#graph_tool.clustering.local_clustering) runs in parallel (shared memory OpenMP, though), I'm not sure how but I have tried it and it's extremely fast even with very big graphs. — jdehesa, Oct 11 '14 at 15:30
I suggest you can try such an algorithm, which first map edges to Neighborhoods, then reduce neighborhoods to nodes. — Indicator, Oct 18 '14 at 02:45

TravisJ · Answer 1 · 2014-12-11T13:54:21.687

In the paper "MapReduce in MPI for large scale graph algorithms" the authors give a nice description of a MapReduce implementation of Triangle Counting. The paper is available here: http://www.sciencedirect.com/science/article/pii/S0167819111000172 but you may need an account to access the paper. (I'm on a University system that's paid for the subscription, so I never know what I only have access to because they've already paid.) The authors may have a draft of the paper posted on the personal website(s).

There is another way you could count triangles--probably much less efficient unless your graph is fairly dense. First, construct the adjacency matrix of your graph, A. Then compute A^3 (you could do the matrix multiplication in parallel pretty easily). Then, sum up the (i,i) entries of A^3 and divide the answer by 6. That'll give you the number of triangles because the i,j entry of A^k counts the number of length k walks from i to j and since we are only looking at length 3 walks, any walk that starts at i and ends at i after 3 steps is a triangle... overcounting by a factor of 6. This is mainly less efficient because the size of the matrix will be very large compared to the size of an edgelist if your graph is sparse.

Distributed local clustering coefficient algorithm (MapReduce/Hadoop)

1 Answers1