Can GraphDB load 10 million statements with OWL reasoning?

Question

I am struggling to load most of the Drug Ontology OWL files and most of the ChEBI OWL files into GraphDB free v8.3 repository with Optimized OWL Horst reasoning on.

is this possible? Should I do something other than "be patient?"

Details:

I'm using the loadrdf offline bulk loader to populate an AWS r4.16xlarge instance with 488.0 GiB and 64 vCPUs

Over the weekend, I played around with different pool buffer sizes and found that most of these files individually load fastest with a pool buffer of 2,000 or 20,000 statements instead of the suggested 200,000. I also added -Xmx470g to the loadrdf script. Most of the OWL files would load individually in less than one hour.

Around 10 pm EDT last night, I started to load all of the files listed below simultaneously. Now it's 11 hours later, and there are still millions of statements to go. The load rate is around 70/second now. It appears that only 30% of my RAM is being used, but the CPU load is consistently around 60.

are there websites that document other people doing something of this scale?
should I be using a different reasoning configuration? I chose this configuration as it was the fastest loading OWL configuration, based on my experiments over the weekend. I think I will need to look for relationships that go beyond rdfs:subClassOf.

Files I'm trying to load:

+-------------+------------+---------------------+
|    bytes    | statements |        file         |
+-------------+------------+---------------------+
| 471,265,716 | 4,268,532  | chebi.owl           |
| 61,529      | 451        | chebi-disjoints.owl |
| 82,449      | 1,076      | chebi-proteins.owl  |
| 10,237,338  | 135,369    | dron-chebi.owl      |
| 2,374       | 16         | dron-full.owl       |
| 170,896     | 2,257      | dron-hand.owl       |
| 140,434,070 | 1,986,609  | dron-ingredient.owl |
| 2,391       | 16         | dron-lite.owl       |
| 234,853,064 | 2,495,144  | dron-ndc.owl        |
| 4,970       | 28         | dron-pro.owl        |
| 37,198,480  | 301,031    | dron-rxnorm.owl     |
| 137,507     | 1,228      | dron-upper.owl      |
+-------------+------------+---------------------+

Is the materialization done during loading of the files? Or is it materialized after all triples have been loaded? Depending on the expressivity that you need, indeed less complex reasoning can significantly increase the performance. OWL Horst is much more complex compared to e.g. RDFS where you can use a fixed order of the rules that have to be applied on the RDF data. I'm aware of some benchmarks that have been used for distributed reasoning, but I think, I can't estimate how much it takes on your data. — UninformedUser, Oct 23 '17 at 18:54
@AKSW I believe materialization is done during the load itself. I have several colleagues who have, like you, suggested going to a less expensive reasoning. I'm starting to write some SPARQL queries against these ontologies in a no-inference repository and they're really long. I was hoping that a more complex rule-set would allow me to write shorter, less explicit queries, but maybe that's naive on my part. I'll post an example soon. — Mark Miller, Oct 23 '17 at 18:58
@MarkMiller, have you tried to load these triples into GraphDB repository with the "No inference" ruleset? I know you need reasoning, but I suspect that results will be approximately the same... Please test, if it is not very time- or cost-expensive! — Stanislav Kralin, Oct 23 '17 at 19:11
@StanislavKralin It only took 200 seconds to load the same data into an RDFS+ "optimized" repo, using a r4.4xlarge server (22.0 GiB RMA, 16 vCPUs), with the statement pool set to 20,000. I haven't tried with inference completely disabled yet. — Mark Miller, Oct 23 '17 at 20:10
@AKSW there are several kinds of data items in these ontologies that are important to my team, and the paths from one to the others can be pretty indirect. I posted some thoughts about why I *might* need RDFS+ or OWL reasoning at: https://stackoverflow.com/questions/46916049/do-i-really-need-owl-reasoning — Mark Miller, Oct 24 '17 at 16:57

score 3 · Accepted Answer · answered Dec 18 '17 at 16:45

3

@MarkMiller you can take a look at the Preload tool, which is part of GraphDB 8.4.0 release. It's specially designed to handle large amount of data with constant speed. Note that it works without inference, so you'll need to load your data and then change the ruleset and reinfer the statements.

http://graphdb.ontotext.com/documentation/free/loading-data-using-preload.html

answered Dec 18 '17 at 16:45

Konstantin Petrov

1,058
6
11

Thanks. Haven't tried preloading yet, but did try changing reasoning level @ http://graphdb.ontotext.com/documentation/standard/configuring-a-repository.html#reconfigure-a-repository. From the system repo, I ran a query about the rulset in my target repo before and after the modification. It was "empty" before the modification and "rdfs-plus-optimized` after. Then I did reinferring @ http://graphdb.ontotext.com/documentation/standard/reasoning.html#reinferring. **Now the reasoning level is blank/greyed out in the web interface and SPARQL queries show no new inferences.** Suggestions? – Mark Miller Jan 12 '18 at 14:44
1

@MarkMiller I suspect you misspelled the ruleset if it's the same from your comment. The steps to change the rulesets are: 1. Add the rulset. PREFIX sys: INSERT DATA { _:b sys:addRuleset "rdfsplus-optimized" } 2. Set the rulset as default. PREFIX sys: INSERT DATA { _:b sys:defaultRuleset "rdfsplus-optimized" } 3. Reinfer. prefix sys: INSERT DATA { [] [] } 4. Note that the UI will show old ruleset. This is OK – Konstantin Petrov Jan 15 '18 at 08:28

score 1 · Answer 2 · answered Jan 16 '18 at 17:28

Just typing out @Konstantin Petrov's correct suggestion with tidier formatting. All of these queries should be run in the repository of interest... at some point in working this out, I misled myself into thinking that I should be connected to the SYSTEM repo when running these queries.

All of these queries also require the following prefix definition

prefix sys: <http://www.ontotext.com/owlim/system#>

This doesn't directly address the timing/performance of loading large datasets into an OWL reasoning repository, but it does show how to switch to a higher level of reasoning after loading lots of triples into a no-inference ("empty" ruleset) repository.

Could start by querying for the current reasoning level/rule set, and then run this same select statement after each insert.

SELECT ?state ?ruleset { ?state sys:listRulesets ?ruleset }

Add a predefined ruleset

INSERT DATA { _:b sys:addRuleset "rdfsplus-optimized" }

Make the new ruleset the default

INSERT DATA { _:b sys:defaultRuleset "rdfsplus-optimized" }

Re-infer... could take a long time!

INSERT DATA { [] <http://www.ontotext.com/owlim/system#reinfer> [] }

Can GraphDB load 10 million statements with OWL reasoning?

2 Answers2

Linked