Reading from arc file (commoncrawl dataset) with ARCReader

Asked Nov 15 '12 at 21:52

Active Nov 16 '12 at 21:37

Viewed 340 times

Well this question may sound stupid, but I did research like hours to find solution but I couldn't so if anyone knows, that would be GREAT!!!

I successfully read arc file (from commoncrawl dataset). With arcHeader.getUrl(); I'm getting all URLs. However I don't understand, if 'outgoing' links from that particular URL is there, if its there how to get those?

[PS] By 'outgoing', I mean, in whole page, which URL it contains as say ad, content etc. Does that commoncrawl arc file contains, if yes how to get those?

Thanks in advance!

EDIT: I solved this, read HTML content and got all ! wasnt that difficult!

edited Nov 16 '12 at 21:37

asked Nov 15 '12 at 21:52

code muncher

1,592
2
27
46

If you are still working on this and still need help, you should ask this question at the official heritrix mailing list page: http://tech.groups.yahoo.com/group/archive-crawler/ – Nielsvh Jul 23 '13 at 16:35

Reading from arc file (commoncrawl dataset) with ARCReader

0 Answers0