Crawler4j With Grails App

Question

I am making a crawler application in Groovy on Grails. I am using Crawler4j and following this tutorial.

I created a new grails project
Put the BasicCrawlController.groovy file in controllers->package
Did not create any view because I expected on doing run-app, my crawled data would appear in my crawlStorageFolder (please correct me if my understanding is flawed)

After that I just ran the application by doing run-app but I didn't see any crawling data anywhere.

Am I right in expecting some file to be created at the crawlStorageFolder location that I have given as C:/crawl/crawler4jStorage?
Do I need to create any view for this?
If I want to invoke this crawler controller from some other view on click of a submit button of a form, can I just write <g:form name="submitWebsite" url="[controller:'BasicCrawlController ']">?

I asked this because I do not have any method in this controller, so is it the right way to invoke this controller?

My code is as follows:

//All necessary imports  



    public class BasicCrawlController {
        static main(args) throws Exception {
            String crawlStorageFolder = "C:/crawl/crawler4jStorage";
            int numberOfCrawlers = 1;
            //int maxDepthOfCrawling = -1;    default
            CrawlConfig config = new CrawlConfig();
            config.setCrawlStorageFolder(crawlStorageFolder);
            config.setPolitenessDelay(1000);
            config.setMaxPagesToFetch(100);
            config.setResumableCrawling(false);
            PageFetcher pageFetcher = new PageFetcher(config);
            RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
            RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
            CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
            controller.addSeed("http://en.wikipedia.org/wiki/Web_crawler")
            controller.start(BasicCrawler.class, 1);

        }
    }


    class BasicCrawler extends WebCrawler {

    final static Pattern FILTERS = Pattern
    .compile(".*(\\.(css|js|bmp|gif|jpe?g"+ "|png|tiff?|mid|mp2|mp3|mp4" +
             "|wav|avi|mov|mpeg|ram|m4v|pdf" +"|rm|smil|wmv|swf|wma|zip|rar|gz))\$")

    /**
     * You should implement this function to specify whether the given url
     * should be crawled or not (based on your crawling logic).
     */
    @Override
    boolean shouldVisit(WebURL url) {
        String href = url.getURL().toLowerCase()
        !FILTERS.matcher(href).matches() &&       href.startsWith("http://en.wikipedia.org/wiki/Web_crawler/")
    }

    /**
     * This function is called when a page is fetched and ready to be processed
     * by your program.
     */
    @Override
    void visit(Page page) {
        int docid = page.getWebURL().getDocid()
        String url = page.getWebURL().getURL()
        String domain = page.getWebURL().getDomain()
        String path = page.getWebURL().getPath()
        String subDomain = page.getWebURL().getSubDomain()
        String parentUrl = page.getWebURL().getParentUrl()
        String anchor = page.getWebURL().getAnchor()

        println("Docid: ${docid} ")
        println("URL: ${url}  ")
        println("Domain: '${domain}'")
        println("Sub-domain: ' ${subDomain}'")
        println("Path: '${path}'")
        println("Parent page:${parentUrl}  ")
        println("Anchor text: ${anchor} " )

        if (page.getParseData() instanceof HtmlParseData) {
            HtmlParseData htmlParseData = (HtmlParseData) page.getParseData()
            String text = htmlParseData.getText()
            String html = htmlParseData.getHtml()
            List<WebURL> links = htmlParseData.getOutgoingUrls()

            println("Text length: " + text.length())
            println("Html length: " + html.length())
            println("Number of outgoing links: " + links.size())
        }
        Header[] responseHeaders = page.getFetchResponseHeaders()
        if (responseHeaders != null) {
            println("Response headers:")
            for (Header header : responseHeaders) {
                println("\t ${header.getName()} : ${header.getValue()}")
            }
        }
        println("=============")
    }
}

Downvoters please give your reason for downvoting if you cannot help. — clever_bassi, Jun 26 '14 at 20:56
IMHO this code does not belong into any controller. You should rather move it into a Grails service. But it general I don't understand your listing. It is not an Grails controller. And why do you use a `main` method within the controller? — saw303, Jun 27 '14 at 03:30
Maybe it is more helpful if you describe what you are trying to achieve. What should your web application do? — saw303, Jun 27 '14 at 03:33
Thanks for helping. Basically I am creating a web application. I have a url and when the user clicks on the crawl button, I want the crawling to begin. I am really unsure where to put this code. Please suggest if you understand my problem statement. Also, i know its a java code. I just converted it to groovy. And i tried removing main but then i got the error: "too many definitions for config".I might have made mistakes since I am a beginner. Thanks a lot — clever_bassi, Jun 27 '14 at 04:22
So a user should be able to submit an URL to your Grails web app which triggers then the crawling process of that URL. But what happens then? What is the response for the user? Can you please clarify your use case. What you maybe want to achieve is a two step process. 1st the url submission and 2nd the crawling and maybe 3rd display or store the crawled data. Since crawling can consume some time you might want to crawl asynchronous. Please provide further information about your use case. — saw303, Jun 27 '14 at 07:11
This is the code that I am using in my groovy class. Is it wrong? My use case is that first the user submits a url to be crawled, then the crawling begins and afrer the crawling is done, I need to present the resource types conrained in a Web page to the user. I thought I would first crawl the pages, store the data then perform parsing using jsoup. Is it the wrong approach? Please suggest. Thanks — clever_bassi, Jun 27 '14 at 12:31
Your code is a bit a copy and paste mess (no offense). Are you new to Grails? I have tried to clean up the controller code. Please have a look at my answer. — saw303, Jun 27 '14 at 13:11
I am new to both Groovy and Grails. Still trying to understand. Thank you so much for all your help. :) — clever_bassi, Jun 27 '14 at 13:19

score 2 · Accepted Answer · answered Jun 27 '14 at 13:07

I'll try to translate your code into a Grails standard.

Use this under grails-app/controller

class BasicCrawlController {

   def index() {
        String crawlStorageFolder = "C:/crawl/crawler4jStorage";
        int numberOfCrawlers = 1;
        //int maxDepthOfCrawling = -1;    default
        CrawlConfig crawlConfig = new CrawlConfig();
        crawlConfig.setCrawlStorageFolder(crawlStorageFolder);
        crawlConfig.setPolitenessDelay(1000);
        crawlConfig.setMaxPagesToFetch(100);
        crawlConfig.setResumableCrawling(false);
        PageFetcher pageFetcher = new PageFetcher(crawlConfig);
        RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
        RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
        CrawlController controller = new CrawlController(crawlConfig, pageFetcher, robotstxtServer);
        controller.addSeed("http://en.wikipedia.org/wiki/Web_crawler")
        controller.start(BasicCrawler.class, 1);

        render "done crawling"

    }
}

Use this under src/groovy

class BasicCrawler extends WebCrawler {

final static Pattern FILTERS = Pattern
.compile(".*(\\.(css|js|bmp|gif|jpe?g"+ "|png|tiff?|mid|mp2|mp3|mp4" +
         "|wav|avi|mov|mpeg|ram|m4v|pdf" +"|rm|smil|wmv|swf|wma|zip|rar|gz))\$")

/**
 * You should implement this function to specify whether the given url
 * should be crawled or not (based on your crawling logic).
 */
@Override
boolean shouldVisit(WebURL url) {
    String href = url.getURL().toLowerCase()
    !FILTERS.matcher(href).matches() &&       href.startsWith("http://en.wikipedia.org/wiki/Web_crawler/")
}

/**
 * This function is called when a page is fetched and ready to be processed
 * by your program.
 */
@Override
void visit(Page page) {
    int docid = page.getWebURL().getDocid()
    String url = page.getWebURL().getURL()
    String domain = page.getWebURL().getDomain()
    String path = page.getWebURL().getPath()
    String subDomain = page.getWebURL().getSubDomain()
    String parentUrl = page.getWebURL().getParentUrl()
    String anchor = page.getWebURL().getAnchor()

    println("Docid: ${docid} ")
    println("URL: ${url}  ")
    println("Domain: '${domain}'")
    println("Sub-domain: ' ${subDomain}'")
    println("Path: '${path}'")
    println("Parent page:${parentUrl}  ")
    println("Anchor text: ${anchor} " )

    if (page.getParseData() instanceof HtmlParseData) {
        HtmlParseData htmlParseData = (HtmlParseData) page.getParseData()
        String text = htmlParseData.getText()
        String html = htmlParseData.getHtml()
        List<WebURL> links = htmlParseData.getOutgoingUrls()

        println("Text length: " + text.length())
        println("Html length: " + html.length())
        println("Number of outgoing links: " + links.size())
    }
    Header[] responseHeaders = page.getFetchResponseHeaders()
    if (responseHeaders != null) {
        println("Response headers:")
        for (Header header : responseHeaders) {
            println("\t ${header.getName()} : ${header.getValue()}")
        }
    }
    println("=============")
  }
}

Thank you so much. It worked for me :) Its such a great help for a beginner. :) — clever_bassi, Jun 27 '14 at 13:31
I have a doubt. In my BasicCrawl.groovy, inside should visit function, I want to use a value that will be passed from my controller(the value that user has entered). How can I pass that value from controller to this groovy class? — clever_bassi, Jun 27 '14 at 16:58

Crawler4j With Grails App

1 Answers1