0

I download an N-Triple file from dbpedia,but when I wanted to read it in to Jena model,some exceptions throw out.Below is a part of this file:

<http://dbpedia.org/resource/Jacky_Cheung> 

<http://dbpedia.org/resource/Template:%E8%97%9D%E4%BA%BA> "\u9AD4\u91CD"@zh .
<http://dbpedia.org/resource/Jacky_Cheung> <http://dbpedia.org/resource/Template:%E8%97%9D%E4%BA%BA> "\u8EAB\u9AD8"@zh .
<http://dbpedia.org/resource/Jacky_Cheung> <http://dbpedia.org/resource/Template:%E8%97%9D%E4%BA%BA> "\u8840\u578B"@zh .
<http://dbpedia.org/resource/Jacky_Cheung> <http://dbpedia.org/resource/Template:%E8%97%9D%E4%BA%BA> "\u8A9E\u8A00"@zh .

The exception throws out is:

Exception in thread "main" com.hp.hpl.jena.shared.InvalidPropertyURIException: http://dbpedia.org/resource/Template:%E8%97%9D%E4%BA%BA
    at com.hp.hpl.jena.xmloutput.impl.BaseXMLWriter.splitTag(BaseXMLWriter.java:393)
    at com.hp.hpl.jena.xmloutput.impl.BaseXMLWriter.startElementTag(BaseXMLWriter.java:368)
    at com.hp.hpl.jena.xmloutput.impl.Unparser$3.wTypeStart(Unparser.java:671)
    at com.hp.hpl.jena.xmloutput.impl.Unparser.wPropertyEltValueString(Unparser.java:488)
    at com.hp.hpl.jena.xmloutput.impl.Unparser.wPropertyEltValue(Unparser.java:473)
    at com.hp.hpl.jena.xmloutput.impl.Unparser.wPropertyElt(Unparser.java:339)
    at com.hp.hpl.jena.xmloutput.impl.Unparser.wPropertyEltStar(Unparser.java:811)
    at com.hp.hpl.jena.xmloutput.impl.Unparser.wTypedNodeOrDescriptionLong(Unparser.java:797)
    at com.hp.hpl.jena.xmloutput.impl.Unparser.wTypedNodeOrDescription(Unparser.java:727)
    at com.hp.hpl.jena.xmloutput.impl.Unparser.wDescription(Unparser.java:686)
    at com.hp.hpl.jena.xmloutput.impl.Unparser.wObj(Unparser.java:642)
    at com.hp.hpl.jena.xmloutput.impl.Unparser.wObjStar(Unparser.java:317)
    at com.hp.hpl.jena.xmloutput.impl.Unparser.wRDF(Unparser.java:298)
    at com.hp.hpl.jena.xmloutput.impl.Unparser.write(Unparser.java:200)
    at com.hp.hpl.jena.xmloutput.impl.Abbreviated.writeBody(Abbreviated.java:143)
    at com.hp.hpl.jena.xmloutput.impl.BaseXMLWriter.writeXMLBody(BaseXMLWriter.java:500)
    at com.hp.hpl.jena.xmloutput.impl.BaseXMLWriter.write(BaseXMLWriter.java:472)
    at com.hp.hpl.jena.xmloutput.impl.Abbreviated.write(Abbreviated.java:128)
    at com.hp.hpl.jena.xmloutput.impl.BaseXMLWriter.write(BaseXMLWriter.java:458)
    at com.hp.hpl.jena.rdf.model.impl.ModelCom.write(ModelCom.java:277)
    at jena.ReadRDF.main(ReadRDF.java:45)
Java Result: 1

The problem is caused by "%E8%97%9D%E4%BA%BA",when use URIref.decode() to decode URI with this string,"%E8%97%9D%E4%BA%BA" represents tow Chinese characters.

But when I use Sesame to read this N-Triple file,it is OK without any problem.

My questions are that whether any way to solve this problem in Jena,and why dbpedia choose N-Triple to be the default RDF syntax?.It works bad with Non-ASCII languages.

Also ,I want to know that,if I want to publish my RDF data as Linked data,but the URIs of resources come with some Chinese and Japanese,should I decode the URIs at first?

Wang Ruiqi
  • 804
  • 6
  • 19

1 Answers1

1

Well, your question isn't completely clear because you asked about "reading in a Jena model" but the stacktrace you quoted actually starts with a call to the writer.

Jena, in general, tries very hard to conform to the relevant RDF recommendations from W3C and IETF. In particular, it tries to not generate any URI's which do not conform to the rules for valid URI's. This is compounded in the case of writing XML, because most RDF identifiers are not legal XML element ID's, meaning that you have to split the URI somewhere and use XML namespaces to make the full identifier. Not all RDF toolkits are as particular as Jena is about conforming to some of the rules in the standards.

Things you can try:

  • do you need to call Model.write() as part of your loading process? You should be able to load and process a model, without the check for legal URI's being invoked.

  • try writing the output using Turtle format, rather than XML. Turtle doesn't have the same restrictions as XML, and it's a heck of a lot easier for humans to read as well.

  • if there are particular ill-formed URI's in the data you are loading, look to see if there is a newer version of the data. Illegal URI's in dbpedia has been an issue in the past. If the illegal URI's are still there in the latest version, notify the dbpedia team about them.

  • try pre-processing your data to remove triples containing illegal URI's before they enter your processing chain.

As for URI's containing Chinese and Japanese characters, Jena conforms to the IRI spec, so as long as your URI's conform to that you should be OK.

Ian Dickinson
  • 12,875
  • 11
  • 40
  • 67