2

im new to crawler4j. I crawled a website to a certain depth and found what i searched for. What i am trying to do now is to trace back my steps and find out how i got on this page. I need a list of the links that led me to the page where the content i was looking for is onto.

My try was to change the visit method in the crawler

@Override
public void visit(Page page) {
  String url = page.getWebURL().getURL();

  // condition for content found
  boolean contentFound = false; 

  // compute 'content found' here

  if (contentFound) {
    System.out.println(page.getWebURL().getParentUrl());
    getMyController().shutdown();
  }
}

This only gives me a String of the parent url.

page.getWebURL().getParentDocid();

only gets me the document id of the parent, but how can i find out the parent of this page?

Thanks!

Elliott Frisch
  • 198,278
  • 20
  • 158
  • 249
IDontKnow
  • 159
  • 1
  • 12

1 Answers1

1

Crawler4J does not seem to make the URLs it has previously visited available in a convenient way. The best thing to do is probably to store them yourself as you visit them in a Map<String,String> from URLs to parents:

parentMap.put(url, page.getWebURL().getParentUrl());

Then, to find the full path, you can trace your way back along the map entries one by one, e.g.:

List<String> path = new ArrayList<String>();
do {
  path.add(url);
  url = parentMap.get(url);
} while(url != null);
Robin Green
  • 32,079
  • 16
  • 104
  • 187