DBPedia (de) data with JENA: character encoding errors ("not unicode")

Question

I try to acces DBpedia (de) data on my local machine. Having downloaded and unzipped some ttl-Files I tried to test a very simple SPARQL query.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdfs:  <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?s 
WHERE
{
 ?s rdf:type skos:Concept .
 ?s rdfs:label ?label .
}
LIMIT 100

using this ARQ command (on Windows):

arq --data dewiki-20140813-article-categories.ttl --query dbpedia_cat.rq

I did expect that nothing could go wrong, but instead, I got a bunch of errors like these:

19:29:02 WARN  riot                 :: [line: 2860693, col: 1 ] Bad IRI: <http:/
/de.dbpedia.org/resource/à_Baby_One_More_Time> Code: 47/NOT_NFKC in PATH: The IR
I is not in Unicode Normal Form KC.
19:29:02 WARN  riot                 :: [line: 2860693, col: 1 ] Bad IRI: <http:/
/de.dbpedia.org/resource/à_Baby_One_More_Time> Code: 56/COMPATIBILITY_CHARACTER
in PATH: TODO
19:29:02 WARN  riot                 :: [line: 2860694, col: 1 ] Bad IRI: <http:/
/de.dbpedia.org/resource/à_Baby_One_More_Time> Code: 47/NOT_NFKC in PATH: The IR
I is not in Unicode Normal Form KC.
19:29:02 WARN  riot                 :: [line: 2860694, col: 1 ] Bad IRI: <http:/
/de.dbpedia.org/resource/à_Baby_One_More_Time> Code: 56/COMPATIBILITY_CHARACTER
in PATH: TODO

After those errors, ARQ added the following:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceede
d
        at org.apache.jena.riot.tokens.TokenizerText.parseToken(TokenizerText.ja
va:170)
        at org.apache.jena.riot.tokens.TokenizerText.hasNext(TokenizerText.java:
86)
        at org.apache.jena.atlas.iterator.PeekIterator.fill(PeekIterator.java:50
)
        at org.apache.jena.atlas.iterator.PeekIterator.next(PeekIterator.java:92
)
        at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.java:99)
        at org.apache.jena.riot.lang.LangTurtleBase.predicateObjectItem(LangTurt
leBase.java:287)
        at org.apache.jena.riot.lang.LangTurtleBase.predicateObjectList(LangTurt
leBase.java:269)
        at org.apache.jena.riot.lang.LangTurtleBase.triples(LangTurtleBase.java:
250)
        at org.apache.jena.riot.lang.LangTurtleBase.triplesSameSubject(LangTurtl
eBase.java:191)
        at org.apache.jena.riot.lang.LangTurtle.oneTopLevelElement(LangTurtle.ja
va:44)
        at org.apache.jena.riot.lang.LangTurtleBase.runParser(LangTurtleBase.jav
a:90)
        at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:42)
        at org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserR
egistry.java:182)
        at org.apache.jena.riot.RDFDataMgr.process(RDFDataMgr.java:906)
        at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:687)
        at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:534)
        at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:501)
        at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:454)
        at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:432)
        at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:422)
        at arq.cmdline.ModDatasetGeneral.addGraphs(ModDatasetGeneral.java:101)
        at arq.cmdline.ModDatasetGeneral.createDataset(ModDatasetGeneral.java:90
)
        at arq.cmdline.ModDatasetGeneralAssembler.createDataset(ModDatasetGenera
lAssembler.java:35)
        at arq.cmdline.ModDataset.getDataset(ModDataset.java:34)
        at arq.query.getDataset(query.java:176)
        at arq.query.queryExec(query.java:198)
        at arq.query.exec(query.java:159)
        at arq.cmdline.CmdMain.mainMethod(CmdMain.java:102)
        at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
        at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
        at arq.arq.main(arq.java:28)

Having tested two unpacking utilities (Ark on Linux and Winrar on Windows), I'm quite sure that unzipping is not the problem here.

Also I have look at the ttl-Files with Notepad++ and all characters seem right to me, even the problematic ones like Ä,Ö, Ü etc.

So, I have no idea how to cope with those errors and would appreciate any help!

(Apologizes for asking a question which is not 100% programming related. But I don't know whether JENA or DBPedia is the problem here and thus, which mailing list would be appropriate. However, it is a beginner's question anyway. So, I hope someone here could help.)

What are you looking for? You are extracting IRI and you need to store string? — Artemis, Mar 24 '15 at 00:19
Well, with sample data from "Learning SPARQL" (by Bob DuCharme) ARQ doesn't report any errors, so I guess something is wrong with my approach or the data itself. I just want to unterstand what is happening or (even better) get rid of those errors before proceeding to any serious query. — cis, Mar 24 '15 at 05:40
You didn't answer me. What do you want to store? You are now storing IRI, but the system doesn't like them. So what is it you need? — Artemis, Mar 24 '15 at 07:09
(Sorry, just wanted to say that the example query is just a mock up.) In the end I want to have a list of persons meeting certain criteria with their URI, their label, their categories and their date of death. BTW: If I try to load the data to TDB via "tdbloader" I get the very same errors. So, I don't think the query is the matter here. — cis, Mar 24 '15 at 07:18
correct, the query is sound, but what the error suggests is that it doesn't like the result coming out of the query. My guess is you need to replace ?s with ?label. — Artemis, Mar 24 '15 at 07:36
Thanks for that suggestion. However, the outcome is the same. — cis, Mar 24 '15 at 09:42

score 1 · Accepted Answer · answered Mar 24 '15 at 10:16

The WARN are just that warnings - not errors. The data is encoded into UTF-8 in a way that is not preferred by W3C standards.

This

--data dewiki-20140813-article-categories.ttl

loads all the data into memory, hence you run out of space. Either load into a database like TDB or if the file looks like it might fir in memory on your machine, increase the heap size.

DBPedia (de) data with JENA: character encoding errors ("not unicode")

1 Answers1