0

I am working on Big Data and my project consist of graphs having text data.I have to calculate similarity between vertices, hop probabilities, number of connected components, page rank vector, and random walk in one project itself.

I implemented them in hadoop but i think it is taking more time(graphs has 2500 nodes 4000 edges 600 connected components taking 25 mins) so what could be the best choice to implement these, apache hadoop or apache giraph or apache twister?

  • Your question does not make any sense. If you have already implemented similarity between vertices, hop probabilities, number of connected components, page rank vector, and random walk for your data using Hadoop(assuming MapReduce) and if all of them take 25 minutes, I would say its not that bad. Possibly you should try filter and optimize it further. – Sambit Tripathy May 29 '15 at 07:05

2 Answers2

1

Finding Connected Components, Page Rank Calculation and Random Walk are examples of iterative algorithms. Traditional Map-Reduce programming model is not a good option for iterative algorithms (specially graph ones). The reason is that in each iteration of map-reduce, all data must be transmitted from mappers to reducers (e.g. high I/O and Network). In Contrast, Giraph is perfect for these kind of algorithms. In giraph, all the data is partitioned and loaded once and in each iteration (super step), only the result are transmitted over machines.

0

Although it is a while since this question has been posted, I thought of pitching into this thread with my experience. For your concern with processing time, it depends on how much processing are you doing with your data ? Are you doing all of the above calculation in a single MR Job or multiple MR jobs in the same program ? If yes, then its possible that it might take time. Also how many iterations are you running for calculating page rank ? What is the size of your cluster ?

I would go with Masoud's answer of selecting Giraph for doing graph processing and would want to add in more. There are several reasons why Graph processing is hard with the Map Reduce Programming model.

  1. You would need to partition graphs to as they wouldn't fit on a single machine. (Range partitioning to keep neighborhoods together for example if you had nodes/users from 5 different universities then most likely you would have all nodes from a single University on the same machine)

  2. You might need to perform replication of your data.

  3. Reduce cross-partition communication.

Coming back to your second concern, not having any knowledge about Apache Twister, I would go for Apache Giraph as it is specifically built for Large Scale Distributed Graph Algorithms where the framework handles all of the heavy processing needs that come along. These are basically because of the feature of graphs algorithms like traversing a graph, passing information along their edges to other nodes etc.

I recently used Giraph for one of my Big Data projects and it was great learning. You should look into that if I am not replying too late.

You could refer to these slides for a detailed explanation.

user3626602
  • 143
  • 2
  • 2
  • 8