Web Crawler for Linked Data in Java with Jena Library

Question

I have to implement a Web Cralwer that visits the Linked Data on the Web. I have built a simple functionality for that. I have three queries for that:

What are the seed URIs I should use. The web sites which provide data in RDF format and follow Tim Berners Lee Linked Data principles ??
Generally what do you mean by round based approach for Web-Cralwers? I read about general Web Crawlers and found that round based approach should be followed.
I am able to parse only web pages which can return RDF/XML data .Is it sufficient to crawl the Linked Data.

score 5 · Answer 1 · answered Sep 24 '12 at 22:25

There's a couple of options, for example use all the URIs found in the Billion Triples Challenge dump as starting points, or all the resources listed in the lodcloud group on the Data Hub (can be retrieved through the CKAN API).
Sorry, I don't know.
No, RDF/XML is not sufficient, as many datasets published as linked data use other formats. You also want Turtle and RDFa. You can use Apache Any23, which understands all of the above. LDSpider is a crawler that uses Any23.

Web Crawler for Linked Data in Java with Jena Library

1 Answers1