3

I'm working on crawler4j using groovy and grails.

I have a BasicCrawler.groovy class in src/groovy and the domain class Crawler.groovy and a controller called CrawlerController.groovy.

I have few properties in BasicCrawler.groovy class like url, parentUrl, domain etc.

I want to persist these values to the database by passing these values to the domain class while crawling is happening.

I tried doing this in my BasicCrawler class under src/groovy

class BasicCrawler extends WebCrawler {
   Crawler obj = new Crawler()
   //crawling code 
   @Override
   void visit(Page page) {
      //crawling code
      obj.url = page.getWebURL().getURL()
      obj.parentUrl = page.getWebURL().getParentUrl()
   }

   @Override
   protected void handlePageStatusCode(WebURL webUrl, int statusCode, String   statusDescription) {
      //crawling code
      obj.httpstatus = "not found"
   }
}

And my domain class is as follows:

class Crawler extends BasicCrawler {
   String url
   String parentUrl
   String httpstatus
   static constraints = {}
}

But I got the following error:

ERROR crawler.WebCrawler  - Exception while running the visit method. Message: 'No such property: url for class: mypackage.BasicCrawler
Possible solutions: obj' at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.unwrap(ScriptBytecodeAdapter.java:50)

After this I tried another approach. In my src/groovy/BasicCrawler.groovy class, I declared the url and parentUrl properties on the top and then used databinding (I might be wrong since I am just a beginner):

class BasicCrawler extends WebCrawler {
   String url
   String parentUrl

   @Override
   boolean shouldVisit(WebURL url) { //code
   }

   @Override
   void visit(Page page) { //code
   }

   @Override
   protected void handlePageStatusCode(WebURL webUrl, int statusCode, String statusDescription) {
      //code}
   }
   def bindingMap = [url: url , parentUrl: parentUrl]
   def Crawler = new Crawler(bindingMap)
}

And my Crawler.groovy domain class is as follows:

class Crawler {
   String url
   String parentUrl
   static constraints = {}
}

Now, it doesn't show any error but the values are not being persisted in the database. I am using mongodb for the backend.

Opal
  • 81,889
  • 28
  • 189
  • 210
clever_bassi
  • 2,392
  • 2
  • 24
  • 43
  • You may temporarily add some `log.debug()` or `println` in various code places and see where is code works and where is not. – wwarlock Jul 01 '14 at 15:27
  • Can you please post the line number of exception on your first approach. The second approach you tried has a bit wrong implementation i.e. you are binding the parameters immediately when a new instance of **BasicCrawler** is created hence the values of **url** && **parentUrl** will always be null. So change it to bind after your **visit** method executed. – Shashank Agrawal Jul 27 '14 at 13:39

1 Answers1

0

I think this example is a bit contrived but here is a way you might solve this problem in current situation:

class BasicCrawler extends WebCrawler {
   @Override
   void visit(Page page) {
      Crawler obj = new Crawler()
      obj.url = page.getWebURL().getURL()
      obj.parentUrl = page.getWebURL().getParentUrl()
      obj.save()
   }

   @Override
   protected void handlePageStatusCode(WebURL webUrl, int statusCode, String   statusDescription) {
      Crawler obj = Crawler.findByUrl(webUrl)
      obj.httpstatus = "not found"
      obj.save()
   }
}

Key here is not using a member instance variable and using the URL to 'refetch' and update original site 'visited' since I'm assuming that will be a unique constraint on each row.

Todd W Crone
  • 1,185
  • 8
  • 23