0

A college and I are individually instantiating electronic health records into triples. We'd like to compare our sets of 10k to 100k triples to see if they have the same shapes.

As a policy, I create URIs based on UUIDs, so nothing semantic is embedded in them. I'd like to stick with this policy, as my college and I are really trying to holistically compare existing workflows.

I know how to compare two RDF files in TopBraid Composer, but I don't think it will be useful if we have the same data patterns but different URIs. I store my triples in Ontotext GraphDB but am glad to use any other tool.

For example, the triples about person ...fe54977c174a and person ...4bcdc1c8abf9 should be considered equivalent, but ...fe54977c174a and ...ae00dc86b3bb should not. Is this feasible?

I would prefer not to spot-check with hand-crafted SPARQL ASK statements.

@prefix ns0: <http://example.com/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.com/4f79ea05-2358-4f43-a335-fe54977c174a>
  a <http://example.com/Person> ;
  ns0:gender ns0:Male ;
  ns0:participatesIn ns0:5d2dfc7b-994c-4933-b787-f7971dae397c .

ns0:5d2dfc7b-994c-4933-b787-f7971dae397c
  a ns0:HealthCareEncounter ;
  ns0:startDate "2019-05-01"^^xsd:date ;
  ns0:hasOutput ns0:a129ca96-c6d2-4a07-a4eb-4cf9ce23a314 .

ns0:a129ca96-c6d2-4a07-a4eb-4cf9ce23a314
  a ns0:Diagnosis ;
  ns0:mentions ns0:Headache .

has the same shape as this (despite the different URIs):

@prefix ns0: <http://example.com/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.com/a740d254-084c-4621-b06d-4bcdc1c8abf9>
  a <http://example.com/Person> ;
  ns0:gender ns0:Male ;
  ns0:participatesIn ns0:060d2091-b4f7-406d-ab0d-75b39b400823 .

ns0:060d2091-b4f7-406d-ab0d-75b39b400823
  a ns0:HealthCareEncounter ;
  ns0:startDate "2019-05-01"^^xsd:date ;
  ns0:hasOutput ns0:bc549711-ed9d-4db6-8cf9-d43022903ef7 .

ns0:bc549711-ed9d-4db6-8cf9-d43022903ef7
  a ns0:Diagnosis ;
  ns0:mentions ns0:Headache .

but this is structurally different (due to the different gender and diagnosis mention):

@prefix ns0: <http://example.com/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.com/aa3a977a-999a-4c5c-9524-ae00dc86b3bb>
  a <http://example.com/Person> ;
  ns0:gender ns0:Female ;
  ns0:participatesIn ns0:b31a62a5-337a-454d-a637-85aefef26684 .

ns0:b31a62a5-337a-454d-a637-85aefef26684
  a ns0:HealthCareEncounter ;
  ns0:startDate "2019-05-01"^^xsd:date ;
  ns0:hasOutput ns0:6566d543-773e-4649-b589-66eb3d0f3165 .

ns0:6566d543-773e-4649-b589-66eb3d0f3165
  a ns0:Diagnosis ;
  ns0:mentions ns0:Nausea .

Mark Miller
  • 3,011
  • 1
  • 14
  • 34
  • Is this just an example here or do you have all the templates in advance? I don't think so, right? Because otherwise, you could just create the SPARQL queries. – UninformedUser May 30 '19 at 16:51
  • Just seing on your example here, the first thing that comes into my mind is basically some (sub)graph isomorphism problem subgraph clustering or here even better tree clustering. Actually, I don't think that it's that complicated but I'm not aware of a standard out-of-the box tool for this. How long do you have time to solve this? – UninformedUser May 30 '19 at 16:58
  • Theoretically there should be a small number of patterns that we could template, since the triples were created by a R2RML process. There will also be data properties with literal values that need to be compared. Furthermore, I would like to make as few assumptions as possible and make this an empirical comparison. If you think more discussion and brainstorming is justified, then I wouldn't need to rush. I appreciate your input. – Mark Miller May 30 '19 at 17:09
  • 1
    Apache Jena contains some code that does isomorphism checking. It is designed for blanknode-blanknode isomorphism but can be adapted. See `IsoMatcher`. Or rewrite to bnodes and use the Graph.isIsomorphic operation (which only works for bnodes). – AndyS May 30 '19 at 18:11
  • Yeah, I used this isomorphism methods from Jena before. As long as you don't have to compare too many individuals, it would be the easiest way with a few lines of code in Java. But, keep in mind, n² comparisons can take some time and you also have to fetch the data for each individuals first. Given that you said, the total number of triples is just 10k up to 100k, it sholdn't take hours though. ( I guess). For larger scale clustering, I'd indeed go with some different approach – UninformedUser May 31 '19 at 05:36
  • Yeah, I used this isomorphism methods from Jena before. As long as you don't have to compare too many individuals, it would be the easiest way with a few lines of code in Java. But, keep in mind, `O(N²)` comparisons can take some time and you also have to fetch the data for each individuals first. Clearly, it should be a smaller number of comparisons as you just have to compare to a single individual in each existing cluster. – UninformedUser May 31 '19 at 07:55
  • Given that you said, the total number of triples is just 10k up to 100k, it shouldn't take hours though. ( I guess). For larger scale clustering, I'd indeed go with some different approach. @MarkMiller what are your thoughts? Do you need some help with settings up the isomorphism stuff? Or do you think it won't scale for your dataset? – UninformedUser May 31 '19 at 07:55
  • 1
    @AKSW My colleague and I have decided to embed identifying information in the URIs for at least the next week or two, so I son't need any help right now. However, I haven't tested RF triples for isomorphism before and the triples I have aren't already in the preferred blank node style, so I may want to talk more in the future. – Mark Miller May 31 '19 at 18:25
  • @AKSW are you going to the International Conference on Biomedical Ontologies in Buffalo, NY this summer? – Mark Miller May 31 '19 at 18:26
  • @AKSW if you're willing to talk outside of Stack Overflow, I can put my email address in this chat for a short period of time. – Mark Miller May 31 '19 at 18:28

1 Answers1

1

Eclipse Rdf4j (bundled with GraphDB) contains a graph isomorphism utility: Models.isomorphic. By default it only does blank node to blank node mappings. So you have two options:

  1. do a replace of each IRI in your graphs with a (dictionary-mapped) blank node. This should be fairly easy to do with a HashMap and a bit of looping or streaming-magic.
  2. have a look at the code for the Models utility and adapt the bit where it does blank node mapping to do IRI mapping instead.
Jeen Broekstra
  • 21,642
  • 4
  • 51
  • 73