crawler4j asynchronously saving results to file

Question

I'm evaluating crawler4j for ~1M crawls per day My scenario is this: I'm fetching the URL and parsing its description, keywords and title, now I would like to save each URL and its words into a single file

I've seen how it's possible to save crawled data to files. However, since I have many crawls to perform I want different threads performing the save file operation on the file system (in order to not block the fetcher thread). Is that possible to do with crawler4j? If so, how?

Thanks

Consider using a `Queue` where you put the data to be written and which are then processed by one/more worker `Thread`s (this approach is nothing `crawler4j`-specific). Search for "producer consumer" to get some general ideas. — qqilihq, Feb 14 '16 at 15:11
@qqilihq how do you share the queue with the crawler? I don't instantiate the crawler myself — Gideon, Feb 14 '16 at 15:12
Not sure I understand the problem. Code sample would help … — qqilihq, Feb 14 '16 at 15:13
You create the crawler using this: `controller.start(MyCrawler.class, numberOfCrawlers);` which means the MyCrawler is getting instantiated by the controller, if I do this how can I share the queue? I can probably make it static (and therefore global) but that's usually a bad idea — Gideon, Feb 14 '16 at 15:15

score 1 · Accepted Answer · answered Feb 14 '16 at 15:29

Consider using a Queue (BlockingQueue or similar) where you put the data to be written and which are then processed by one/more worker Threads (this approach is nothing crawler4j-specific). Search for "producer consumer" to get some general ideas.

Concerning your follow-up question on how to pass the Queue to the crawler instances, this should do the trick (this is only from looking at the source code, haven't used crawler4j on my own):

final BlockingQueue<Data> queue = …

// use a factory, instead of supplying the crawler type to pass the queue
controller.start(new WebCrawlerFactory<MyCrawler>() {
    @Override
    public MyCrawler newInstance() throws Exception {
        return new MyCrawler(queue);
    }
}, numberOfCrawlers);

I totally missed that factory. Thanks! – Gideon Feb 14 '16 at 16:43 — Gideon, Feb 14 '16 at 16:43

crawler4j asynchronously saving results to file

1 Answers1