Can't crawl RDF Data with Apache Nutch

Question

I am trying to Crawl the DBpedia with Apache Nutch 1.15, but i'm having problems with parsing RDF files.

On the parsing phase, i only get this message:

**apache_nutch | Error parsing: http://dbpedia.org/data/Moscow.xml: failed(2,0): Can't retrieve Tika parser for mime-type application/rdf+xml **

following this reference, i configured my parse-plugins.xml to parse application/rdf+xml as this:

<mimeType name="application/rdf+xml">
    <plugin id="parse-tika" />
    <plugin id="feed" />
</mimeType>

But still, the message persists.

Even when i use Any23, mapping the parse filter as

<alias name="any23-parserFilter"
        extension-id="Any23Parser" />

and setting the parsers for the mime type as:

<mimeType name="application/rdf+xml">
    <plugin id="parse-tika" />
    <plugin id="feed" />
</mimeType>

The message still persists.

What i'm missing here?

I can't really help here with the Nutch issue, but I'm wondering why do you need to crawl DBpedia? — UninformedUser, Sep 25 '19 at 06:45
@AKSW, i just used DBpedia as an example. I have no intention of crawling since the dumps are available. Actually, i just want to crawl RDF data from other sources. — gsjunior86, Sep 25 '19 at 09:13

score 2 · Accepted Answer · answered Oct 01 '19 at 09:57

The Nutch any23 plugin is targeted to embedded RDF (RDFa) and Microdata. Technically, it only implements the HtmlParseFilter which requires that the document is successfully parsed by a Parser implementation.

To extract RDFa, try this and you should see many extracted triples:

> bin/nutch parsechecker \
   -Dany23.extractors=html-microdata,html-rdfa11 \
   -Dplugin.includes='protocol-http|parse-html|any23' \
  https://schema.org/NewsArticle
...
Any23-Triples=<https://schema.org/NewsArticle> <http://www.w3.org/ns/rdfa#usesVocabulary> <http://schema.org/> .
...

Can't crawl RDF Data with Apache Nutch

1 Answers1