0

I'm trying to replicate a DBpedia for an experiment.

I download the latest dataset of DBpedia from: http://downloads.dbpedia.org/2015-10/core/ and store them a directory dbp_201510/.

I tried to load the dataset using tdbloader2.

tdbloader2 --loc tdb dbp_201510/*

However, I receive the following error.

ERROR [line: 2, col: 145] Illegal character in IRI (codepoint 0x60, '`'): <http://www4.wiwiss.fu-berlin.de/gutendata/resource/people/[`]...>
org.apache.jena.riot.RiotException: [line: 2, col: 145] Illegal character in IRI (codepoint 0x60, '`'): <http://www4.wiwiss.fu-berlin.de/gutendata/resource/people/[`]...> at org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:136)
at org.apache.jena.riot.lang.LangEngine.raiseException(LangEngine.java:165)
at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.java:108)
at org.apache.jena.riot.lang.LangNTriples.parseOne(LangNTriples.java:71)
at org.apache.jena.riot.lang.LangNTriples.runParser(LangNTriples.java:58)
at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:42)
at org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:176)
at org.apache.jena.riot.RDFDataMgr.process(RDFDataMgr.java:861)
at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:667)
at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:637)
at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:626)
at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:617)
at org.apache.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.exec(CmdNodeTableBuilder.java:165)
at jena.cmd.CmdMain.mainMethod(CmdMain.java:93)
at jena.cmd.CmdMain.mainRun(CmdMain.java:58)
at jena.cmd.CmdMain.mainRun(CmdMain.java:45)
at org.apache.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.main(CmdNodeTableBuilder.java:85)

In addition, I receive a lot of warnings as below.

WARN  [line: 92881, col: 1 ] Bad IRI: <http://dbpedia.org/resource/Ranma_½> Code: 56/COMPATIBILITY_CHARACTER in PATH: TODO
WARN  [line: 92882, col: 1 ] Bad IRI: <http://dbpedia.org/resource/Ranma_½> Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC.

I use Apache Jena 3.0.1.

I'm looking for a way to avoid this error. In addition, is there a good way to load without warning.

I did same thing for the former version of DBpedia (http://downloads.dbpedia.org/2015-04/core/) and loading was successfully completed without any warning and error.

Benben
  • 1,355
  • 5
  • 18
  • 31

1 Answers1

1

The data should be make legal before loading. The 0x60, '`' is not legal in a URI. Maybe you want to replace it with %60 (it is then a different URI).

In many large datasets, data isn't perfect. It is worth checking it before loading using "riot --validate".

The warnings are just warning, not errors, and indicate that teh UTF-8 is not in the standards preferred form and might cause matching problems later. It looks like ½ can be written in different ways in UTF-8.

(I'm sure the DBpedia team would appreciate some feedback.)

AndyS
  • 16,345
  • 17
  • 21