I am trying to get the data from various websites.After searcing in stack overflow, am using crawler4j as many suggested this. Below is my understanding/design:
1. Get sitemap.xml from robots.txt.
2. If sitemap.xml is not available in robots.txt, look for sitemap.xml directly.
3. Now, get the list of all URL's from sitemap.xml
4. Now, fetch the content for all above URL's
5. If sitemap.xml is also not available, then scan entire website.
Now, can you please please let me know, is crawler4J able to do steps 1, 2 and 3 ??? Please suggest any more good design is available (Assuming no feeds are available) If so can you please guide me how to do.
Thanks Venkat