Efficient design of crawler4J to get data

Question

I am trying to get the data from various websites.After searcing in stack overflow, am using crawler4j as many suggested this. Below is my understanding/design:

 1. Get sitemap.xml from robots.txt.
 2. If sitemap.xml is not available in robots.txt, look for sitemap.xml directly.  
 3. Now, get the list of all URL's from sitemap.xml 
 4. Now, fetch the content for all above URL's
 5. If sitemap.xml is also not available, then scan entire website.

Now, can you please please let me know, is crawler4J able to do steps 1, 2 and 3 ??? Please suggest any more good design is available (Assuming no feeds are available) If so can you please guide me how to do.

Thanks Venkat

Any help will be greatly appreciated ... – topblog Feb 26 '12 at 02:37 — topblog, Feb 26 '12 at 02:37

score 3 · Answer 1 · answered Feb 14 '13 at 08:44

Crawler4J is not able to perform steps 1,2 and 3, however it performs quite well for steps 4 and 5. My advice would be to use a Java HTTP Client such as the one from Http Components to get the sitemap. Parse the XML using any Java XML parser and add the urls into a collection. Then populate your crawler4j seeds with the list :

for(String url : sitemapsUrl){
 controller.addSeed(url);
}
controller.start(YourCrawler, nbthreads);

score 1 · Answer 2 · edited Jul 26 '12 at 15:03

1

I have never used crawler4j, so take my opinion with a grain of salt: I think that it can be done by the crawler, but it looks like you have to modify some code. Specifically, you can take a look at the RobotstxtParser.java and HostDirectives.java. You would have to modify the parser to extract the sitemap and create a new field in the directives to return the sitemap.xml. Step 3 can be done in the fetcher if no directives were returned from sitemap.txt.

However, I'm not sure exactly what you gain by checking the sitemap.txt: it seems to be a useless thing to do unless you're looking for something specific.

edited Jul 26 '12 at 15:03

javanna

59,145
14
144
125

answered Feb 26 '12 at 16:43

Kiril

39,672
31
167
226

Thx Lirik. I heard that some websites will provide the list of all product url's in the sitempap.xml (mentioned in robots.txt). Instead of crawling for entire website, i though its good option to go through sitemap.xml. And also guess crawling entire may gave some unnecesary links (faq etc ...) also, what do you say? – topblog Feb 27 '12 at 18:05
Actually my requirement is to get the list of all URL's of different categories like books, mobiles, laptops etc similar to pricegrabber. – topblog Feb 27 '12 at 18:07

Efficient design of crawler4J to get data

2 Answers2