1

I have a problem to load a list of links; these links should be used by controller.addSeed in a loop. Here is the code

SelectorString selector = new SelectorString();
List <String> lista = new ArrayList<>();
lista=selector.leggiFile();
String crawlStorageFolder = "/home/usersstage/Desktop/prova";
for(String x : lista){
    System.out.println(x);
    System.out.println("----");
}

// numberOfCrawlers mostra il numero di thread inizializzati per il
// crawling

int numberOfCrawlers = 2; // threads
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);

// Non mandare più di una richiesta per secondo (1000 mills || 200
// mills?)
config.setPolitenessDelay(200);

// profondità del crawl. -1 per illimitato
config.setMaxDepthOfCrawling(-1);

// numero massimo di pagine da crawllare
config.setMaxPagesToFetch(-1);

config.setResumableCrawling(false);

// instanza del controller per questo crawl
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig,
        pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher,
        robotstxtServer);
// LOOP used to add several websites (more than 100)
for(int i=0;i<lista.size();i++){
    controller.addSeed(lista.get(i).toString());    
}
controller.start(Crawler.class, numberOfCrawlers);

I need to crawl into this sites and retrieve only rss pages but the output of the crawled list is empty.

Mandar Pandit
  • 2,171
  • 5
  • 36
  • 58
Justin
  • 1,149
  • 2
  • 19
  • 35
  • I had to wait about 10 min and it started to crawl...how it is possible? – Justin Aug 05 '14 at 12:35
  • did you solve, if solved means can u help to solve this http://stackoverflow.com/questions/30323522/calling-controllercrawler4j-3-5-inside-loop – Selva May 19 '15 at 11:08

2 Answers2

0

That code that you posted shows how to configure the CrawlController. But you need to configure the Crawler if you only need to crawl rss resources. The logic belongs in the 'shouldVisit' method on the crawler. Check this example.

Răzvan Petruescu
  • 685
  • 1
  • 8
  • 16
0

You will try it below code and can you check shoulVisit method in craler class.

for(int i=0;i<lista.size();i++){
    controller.addSeed(lista.get(i).toString()); 
    controller.start(Crawler.class, numberOfCrawlers);   
}
Pang
  • 9,564
  • 146
  • 81
  • 122
Rahul
  • 1
  • 1