crawler4j get full parent list

Question

im new to crawler4j. I crawled a website to a certain depth and found what i searched for. What i am trying to do now is to trace back my steps and find out how i got on this page. I need a list of the links that led me to the page where the content i was looking for is onto.

My try was to change the visit method in the crawler

@Override
public void visit(Page page) {
  String url = page.getWebURL().getURL();

  // condition for content found
  boolean contentFound = false; 

  // compute 'content found' here

  if (contentFound) {
    System.out.println(page.getWebURL().getParentUrl());
    getMyController().shutdown();
  }
}

This only gives me a String of the parent url.

page.getWebURL().getParentDocid();

only gets me the document id of the parent, but how can i find out the parent of this page?

Thanks!

score 1 · Accepted Answer · answered Nov 28 '13 at 22:14

Crawler4J does not seem to make the URLs it has previously visited available in a convenient way. The best thing to do is probably to store them yourself as you visit them in a Map<String,String> from URLs to parents:

parentMap.put(url, page.getWebURL().getParentUrl());

Then, to find the full path, you can trace your way back along the map entries one by one, e.g.:

List<String> path = new ArrayList<String>();
do {
  path.add(url);
  url = parentMap.get(url);
} while(url != null);

crawler4j get full parent list

1 Answers1