Controlling the list of URL(s) to be crawled at runtime

Question

In crawler4j we can override a function boolean shouldVisit(WebUrl url) and control whether that particular url should be allowed to be crawled by returning 'true' and 'false'.

But can we add URL(s) at runtime ? if yes , what are ways to do that ? Currently I can add URL(s) at beginning of program using addSeed(String url) function before the start(BasicCrawler.class, numberOfCrawlers) in CrawlController class and if I try to add new url using addSeed(String url), it gives error. Here is error image .

Any help will be appreciative and please let me know if any more detail about project is required to answer the question .

samsamara · Answer 1 · 2012-07-19T06:44:19.907

You can do this.

Use public void schedule(WebURL url) to add URLs to the crawler frontier which is a member of the Frontier.java class. But for this you need to have your url of type WebURL. If you want to make a WebURL out of your string. Please have a look at the addSeed() (below code) which is in the CrawlController.java class to see how it has converted the string (url) into a WebURL.

Also use the existing frontier instance.

Hope this helps..

public void addSeed(String pageUrl, int docId) {
        String canonicalUrl = URLCanonicalizer.getCanonicalURL(pageUrl);
        if (canonicalUrl == null) {
            logger.error("Invalid seed URL: " + pageUrl);
            return;
        }
        if (docId < 0) {
            docId = docIdServer.getDocId(canonicalUrl);
            if (docId > 0) {
                // This URL is already seen.
                return;
            }
            docId = docIdServer.getNewDocID(canonicalUrl);
        } else {
            try {
                docIdServer.addUrlAndDocId(canonicalUrl, docId);
            } catch (Exception e) {
                logger.error("Could not add seed: " + e.getMessage());
            }
        }

        WebURL webUrl = new WebURL();
        webUrl.setURL(canonicalUrl);
        webUrl.setDocid(docId);
        webUrl.setDepth((short) 0);
        if (!robotstxtServer.allows(webUrl)) {
            logger.info("Robots.txt does not allow this seed: " + pageUrl);
        } else {
            frontier.schedule(webUrl); //method that adds URL to the frontier at run time
        }
    }

score 0 · Answer 2 · answered Jul 14 '12 at 09:37

0

Presumably you can implement this function however you like, and have it depend on a list of URLs that should not be crawled. The implementation of shouldVisit is then going to involve asking if a given URL is in your list of forbidden URLs (or permitted URLs), and returning true or false on that basis.

answered Jul 14 '12 at 09:37

Gian

13,735
44
51

yeah , i understood your answer but my question was if I have given a seed in beginning as 'www.facebook.com', then all the links in 'facebook' domain will pass from function `code`('shouldVisit') and depending upon implemention of function they will(will not) be allowed but can i add new seed say `code`('www.google.com') in between while it is crawling for `code` (facebook) in its list of URL(s) to be crawled . Am i clear to you ? – Jul 14 '12 at 09:54
Yes, and my answer is the same. You have to change the implementation of your function to depend upon some datastructure which you can update. – Gian Jul 14 '12 at 09:55
Have you looked at the `controller.addSeed("http://www.ics.uci.edu/");` example on the front page of the crawler4j site? It looks like you just need to call this again - it's basically a new crawl, but I don't see that this should make much of a difference? – Gian Jul 14 '12 at 09:58
yeah , I have this function but calling this function at runtime gives error . I mean if I call this function after calling this function 'controller.start(MyCrawler.class, numberOfCrawlers)' , it's start giving error. – Jul 14 '12 at 10:05
OK, then perhaps you should edit your question to reflect that problem you are having, because there is not enough detail for anybody to help at the moment. – Gian Jul 14 '12 at 10:10

Controlling the list of URL(s) to be crawled at runtime

2 Answers2