1

I am trying to understand the logic of Set Operations (Union, Addition, Intersection, Difference,Xor) in RDFlib, and have done some tests with identical files for which the results didn't match my naive expectations. Therefore, I have tested the "in" operator two ways:

I looped over all items in graph A and checking if they exist in graph B, after initializing A from a tiny RDF/Turtle test file, and initializing B either :

  1. by setting B=A
    A = Graph()
    A.parse("A.ttl", format='turtle')
    B=A
    
    for t in A.triples((None, None, None)):
        if t in B:
            print(f"found {t} in B")
        else:
            print(f"didn't find {t} in B")
  1. by loading it from the same file
    A = Graph()
    A.parse("A.ttl", format='turtle')
    B = Graph()
    B.parse("A.ttl", format='turtle')
    
    for t in A.triples((None, None, None)):
        if t in B:
            print(f"found {t} in B")
        else:
            print(f"didn't find {t} in B")

In case 1), all triples in A were also found in B -- as expected In case 2),only part of the triples in A were also found in B. (those without BNodes)

Is there any way to avoid the behavious of case 2).. or did I misunderstand something very basic? (i'm an RDF newbie, but otherwise not afraid of graphs)

cheers Joel

Joel Thill
  • 25
  • 4
  • it's basically a matter of the parser. In your case, the bnode IDs have only to be stable per RDF document among all the RDF triples, thus, for one parse. When you parse the same RDF document twice, there is no guarantee that the ID is the same as before - it only has to ensure that each bnode is the same among the RDF triples. Most implementations use a random number generator or UUID etc. – UninformedUser Jun 15 '21 at 17:12
  • - some implementations allow for using exactly the same identifier during their internal datastructures - but it's implementation specific. And indeed, this doesn't work for anonymous bnodes like in Turtle syntax with `:s :p [:q :o] .` - there you also need a stable counter which indeed is trivial for sequential parsing. But it has to be used first. For `rdflib` I'm not sure, probably a dev will give an answer here soon – UninformedUser Jun 15 '21 at 17:17

1 Answers1

1

Blank nodes have no identity outside a graph. If you process the same file with blank nodes twice, you should expect that blank nodes get different internal identifiers.

For reference, section 3.5 of the RDF 1.1 Concepts and Abstract Syntax explains:

Blank nodes do not have identifiers in the RDF abstract syntax. The blank node identifiers introduced by some concrete syntaxes have only local scope and are purely an artifact of the serialization.

In situations where stronger identification is needed, systems MAY systematically replace some or all of the blank nodes in an RDF graph with IRIs. Systems wishing to do this SHOULD mint a new, globally unique IRI (a Skolem IRI) for each blank node so replaced.

So, to work around this, you can give blank nodes an IRI that persists beyond working with the in-memory graph. The referenced section has guidance on how to mint such IRIs.

Ben Companjen
  • 1,417
  • 10
  • 24