0

I am writing a code, that gathers some statistics about ontologies. as input I have a folder with files some are RDF/XML, some are turtle or nt. My problem is, that when I try to parse a file using wrong format, next time even if I parse it with correct format it fails. Here test file is turtle format. If first parse it with turtle format all is fine. but if I first parse it with the wrong format 1. error is understandable (file:///test:1:0: not well-formed (invalid token)), but error for second is (Unknown namespace prefix : owl). Like I said when I first parse with the correct one, I don't get namespace error.

Pleas help, after 2 days, I'm getting desperate.

query = 'SELECT DISTINCT ?s ?o WHERE {  ?s ?p owl:Ontology .   ?s  rdfs:comment  ?o}'
data = open("test", "r")
g = rdflib.Graph("IOMemory")

try:
    result = g.parse(file=data,format="xml")
    relations = g.query(query)
    print(( " graph has %s statements." % len(g)))
except:
    print "bad1"
    e = sys.exc_info()[1]
    print e

try:
    result = g.parse(file=data,format="turtle")
    relations = g.query(query)
    print(( " graph has %s statements." % len(g)))
except :
    print "bad2"
    e = sys.exc_info()[1]
    print e
Andrejs
  • 130
  • 5

1 Answers1

0

The problem is that the g.parse reads some part from the file input stream of data first, only to figure out afterwards that it is not xml. The second call (with the turtle format) then continues to read from the input stream after the part where the previous attempt has stopped. The part read by the first parser is lost to the secnd one.

If your test file is small, the xml-parser might have read it all, leaving an "empty" rest. It seems the turtle parser did not complain - it just read in nothing. Only the query in the next statement failed to find anything owl-like in it, as the graph is empty. (I have to admit I cannot reproduce this part, the turtle parser does complain in my case, but maybe I have a different version of rdflib)

To fix it, try to reopen the file; either reorganize the code so you have an data = open("test", "r") every time you call result = g.parse(file=data, format="(some format)"), or call data.seek(0) in the except: clause, like:

for format in 'xml','turtle':
  try:
    print 'reading', format
    result = g.parse(data, format=format)
    print 'success'
    break
  except Exception:
    print 'failed'
    data.seek(0)
Clemens Klein-Robbenhaar
  • 3,457
  • 1
  • 18
  • 27
  • Thanks, this worked great! Actually for me was enough to add `data.seek(0)` If someone wants to reproduce my error, I used this as test file [link](http://www.w3.org/2000/01/rdf-schema). And my RDFlib version is 4.2.0. – Andrejs Mar 02 '15 at 14:57