I am currently working on a college project wherein we are building a content extractor for the World Wide Web. For this, so far we have 2 modules : a Web crawler, and an Indexer which will be running on 2 separate machines. We plan to add more modules as our work progresses, but right now, we need some means of communication between the 2, some form of message passing.
What we are unsure of, is the following :
(i) We feel that our application does not need synchronous message passing. Basically the crawler module crawls the web pages and calls the Indexer module when it visits a particular page. So should we go ahead and choose some asynchronous protocol ( like JMS ) or is there some advantage to using a synchronous protocol instead ?
(ii) We are currently thinking of using JMS, with maybe google protocol buffers for passing the necessary data ( the URLs ) between the 2 machines. Would this be appropriate, or are there any better options ?
Our main criteria for a suitable protocol would be scalabilty, followed by speed.
This is the first time any of us are working on a distributed application of any kind. So any help would be most appreciated :)
Thank you :)