2

I have created a custom crawler using crawler4j. In my app, I create a lot of controllers and after a while, the number of threads in the system will hit the maximum value and the JVM will throw an Exception. Even though I call ShutDown() on the controller, and set it as null and call System.gc(), the threads in my app remain open and the app will crash.

I used the jvisualvm.exe (Java VisualVM) and saw that at one point my app hits 931 threads.

Is there a way I can immediately kill all the threads created by the CrawlController object of the crawler4j project? (or any other object for that matter)

Alireza Noori
  • 14,961
  • 30
  • 95
  • 179
  • Do you have control over the run() method of the Threads? Can you show us? Sounds to me that the Threads don't die. – Fildor Feb 01 '13 at 10:56
  • I use the .jar file of the crawler4j class. However if I can't find a simple way to do this, I can access the source code of the crawler4j. I want to stop crawler4j's controller's threads. – Alireza Noori Feb 01 '13 at 11:00
  • From the Homepage of crawler4j >You should also implement a controller class which specifies the seeds of the crawl, the folder in which intermediate crawl data should be stored **and number of concurrent threads**: – Fildor Feb 01 '13 at 11:15
  • I am doing that, but I want to create a controller which can shutdown the threads created by the crawler4j – Alireza Noori Feb 01 '13 at 11:17
  • 1
    I just had a look at the code ... Each crawler controller seems to have a MonitorThread ... didn't see at first glance how to achieve that. – Fildor Feb 01 '13 at 11:22

3 Answers3

2

I just spent 2 hours struggling with the exact same problem. I finally discovered the source of the bug. If create a controller, and down't start it, shutdown() won't kill any of the threads created. Instead, you have to use the following:

controller.shutdown();
controller.getPageFetcher().shutdown();

where controller is your instance of CrawlController.
I also raised this as an issue on the crawler4j project page, and it looks like this will be fixed by the release of version 3.6

Ephraim
  • 8,352
  • 9
  • 31
  • 48
  • I don't have the code right now to test whether it works for me or not, but it seems like it's going to. Nonetheless, I'm going to mark this as answer. Thanks for sharing your solution. – Alireza Noori Aug 26 '14 at 05:47
2

Ephraim is correct. There are two issues in Crawler4j:

  1. not closing Environment object in CrawlController.
  2. not closing PageFetcher object in CrawlController.

https://code.google.com/r/yonid-crawler4j/

I have done my best at creating a version that Shutdown properly after start (startunblocking) as well as having a forceShutdown for cases where you create a controller and do not run a start function.

LarsTech
  • 80,625
  • 14
  • 153
  • 225
Jonathan Druck
  • 281
  • 3
  • 4
0

ShutDown() asks kindly the threads to finish their jobs and will shoot down afterwards, but what if the Threads have endless tasks so they will never finish? Have you tried to use shutdownNow()? This will interrupt running tasks before there are finished and shoots down the the threads immediately.

Simulant
  • 19,190
  • 8
  • 63
  • 98