0

I am currently working on a college project wherein we are building a content extractor for the World Wide Web. For this, so far we have 2 modules : a Web crawler, and an Indexer which will be running on 2 separate machines. We plan to add more modules as our work progresses, but right now, we need some means of communication between the 2, some form of message passing.

What we are unsure of, is the following :

(i) We feel that our application does not need synchronous message passing. Basically the crawler module crawls the web pages and calls the Indexer module when it visits a particular page. So should we go ahead and choose some asynchronous protocol ( like JMS ) or is there some advantage to using a synchronous protocol instead ?

(ii) We are currently thinking of using JMS, with maybe google protocol buffers for passing the necessary data ( the URLs ) between the 2 machines. Would this be appropriate, or are there any better options ?

Our main criteria for a suitable protocol would be scalabilty, followed by speed.

This is the first time any of us are working on a distributed application of any kind. So any help would be most appreciated :)

Thank you :)

arya
  • 565
  • 1
  • 6
  • 17

1 Answers1

1

I worked on a similar system for real a few years ago where the Web crawler was looking for malware sites to add to a list of black listed sites (it was a security company).

Our crawlers worked independently from the workers. This allowed better scalability and performance.

The crawlers put data into a DB. A job would then kick off at regular intervals and get unprocessed records (I think we had a status column) and then pass to the worker threads for processing in parallel.

If I was to do this today, I would use a nosql DB like mongodb and some map reduce algorithm.

hope that's useful.

Rakesh

TacticalCoder
  • 6,275
  • 3
  • 31
  • 39
FinalFive
  • 1,465
  • 2
  • 18
  • 33
  • Thank u :) The Indexer module is currently using mongodb. What are the advantages of Map-Reduce over simple message passing ? Currently, we are not dealing with clusters of machines, just individual computers per module. I thought that map-reduce would be needed later on if we distribute the work of a single module over multiple machines, am I right ? – arya Mar 24 '12 at 07:03