1

Well this question may sound stupid, but I did research like hours to find solution but I couldn't so if anyone knows, that would be GREAT!!!

I successfully read arc file (from commoncrawl dataset). With arcHeader.getUrl(); I'm getting all URLs. However I don't understand, if 'outgoing' links from that particular URL is there, if its there how to get those?

[PS] By 'outgoing', I mean, in whole page, which URL it contains as say ad, content etc. Does that commoncrawl arc file contains, if yes how to get those?

Thanks in advance!

EDIT: I solved this, read HTML content and got all ! wasnt that difficult!

code muncher
  • 1,592
  • 2
  • 27
  • 46
  • If you are still working on this and still need help, you should ask this question at the official heritrix mailing list page: http://tech.groups.yahoo.com/group/archive-crawler/ – Nielsvh Jul 23 '13 at 16:35

0 Answers0