I've implemented a basic crawler that retrieves data from seed Urls and is able to download the pages. further i am able to keep my crawler in the same seed website until the specified depth is achieved. How can I impose more restrictions on my crawler like a page is downloaded only if passes a minimum threshold of predefined keywords? Is there any such method in shouldvisit() function?
1 Answers
Unfortunately you have an impossible constraint that is standard for crawlers. You have to download the page in order to determine if it contains the keywords you are looking for. Like most crawlers, crawler4j can only operate on the data they've downloaded, for pages that it has not crawled yet it only know about its URL string, which can but most often not contain some keywords. The
public boolean shouldVisit(WebURL url)
is indeed the only official place (i.e., without modifying the original library) where you can make that decision and you have to base it around the URL.
However, if some reason you must know about keywords before you downloaded the page you could consider using a 3rd party web service API like Bing that indexes public web pages and try to see if their search results for that page contains keywords you are looking for - but this would only work public sites that services like Bing can access. You'd also need to weigh the pros and cons for querying Bing versus just downloading the page yourself, in which most cases downloading it yourself probably makes more sense.
One last thought in case I misread you. If you mean to not download any more links/pages based on the page you just downloaded (e.g., do not visit any more links on page X because, page X did not contain the right keywords, so the links on this pages are assumed bad). If that's the case you'd have to access the parent URL from some central datastore like a database and check to see if you should visit it in the:
public boolean shouldVisit(WebURL url)
provided you added said information to central datastore in the:
public void visit(Page page)
method. Regardless the shouldVisit is the final method determining if the crawler should go fetch content. By default all you have to go on is the URL information provided there, or whatever else you try to use like your own populated datastore, or a 3rd party API. One last warning is that if you do use a centralized datastore, or 3rd party API, crawler4j is multithreaded so you want to take that into account when accessing anything from shouldVisit method.

- 370
- 3
- 13
-
Thanks Jordan! I ve left with no option but to download the webpage, I thought if there was any gap between the point crawler has an input webpage and download it. As in if the thread could scan the webpage and only download page if it meets my requirements (Aim is to save space on my local machine ) – Aditya Kohli Nov 08 '14 at 15:45