1

I've got an RDF4j disk based Native Store with roughly 9M of triples. I'm trying to improve the performance of deleting of about 4K triples which now takes ~20 seconds. I've tried:

1

    Repository rep = new SailRepository(new NativeStore(new File(DATA_DIR + "/db"), "spoc, posc, opsc"));
    diskRep.initialize();
    RepositoryConnection conn = rep.getConnection();
    conn.remove(statements); // first find statements, then pass them into remove method

2

    // Execute with conn.prepareUpdate(QueryLanguage.SPARQL, query)
    DELETE DATA 
    {
      <#book2> <http://purl.org/dc/elements/1.1/title>   "David Copperfield" ; 
             <http://purl.org/dc/elements/1.1/creator> "Edmund Wells"      .
      // all triples explicitly here
    }

3

    // Execute with conn.prepareUpdate(QueryLanguage.SPARQL, query)
    DELETE { ?person ?property ?value } 
    WHERE 
      { ?person ?property ?value ; <http://xmlns.com/foaf/0.1/givenName> "Fred" }
      // query pattern

All three methods show similar timings. I believe there's a quicker way to remove 4K triples. Please, let me know if you've got any ideas of what I'm doing wrong. I'll be glad to provide additional details.

  • 1
    A blind guess: In the first code excerpt, try wrapping the remove statement [in a transaction](https://docs.rdf4j.org/programming/#_transactions) by adding `conn.begin();` and `conn.commit();`. (That probably won't help but worth a try.) – cygri May 22 '19 at 19:39
  • @cygri yeah, sorry I've skipped the details. I've tried this and even set IsolationLevels.NONE - always the same result. – Alex Boyarintsev May 22 '19 at 21:56
  • 1
    you could wait here until @JeenBroekstra or maybe directly open a Github ticket. At least, I can't see what you could change for those simple operations on a tiny dataset. And I also couldn't find any settings for the native store w.r.t. SPARQL update operations. There was also only one open ticket with performance issues that was somehow related, but not directly because it's for the `CLEAR` operation: https://github.com/eclipse/rdf4j/issues/545 – UninformedUser May 23 '19 at 04:45
  • Are `INSERT` perfomance and query performance good also affected? Which RDF4J version do you use? And which hardware? – UninformedUser May 23 '19 at 04:46
  • @AKSW Thanks! INSERT - conn.addStatements(statements) - of the same triples I'm removing is almost instant ~30 ms. I was using Sesame 2.7.14, now switched to rdf4j 2.5.1 - the result is the same. I'm working on a 6th gen core i7 with NVMe SSD, the production server is running on 32 core AWS r4.8xlarge. Again the timings are quite similar and concurrency doesn't seem to help. – Alex Boyarintsev May 23 '19 at 06:36
  • 1
    Are you by any chance using inference or SPIN rules? They tend to make removals very sloooow on RDF4J. – kidney May 23 '19 at 10:28
  • @kidney no, the simplest config – Alex Boyarintsev May 23 '19 at 10:45
  • 2
    FWIW we've managed to reproduce this and it looks like a bug in how transaction buffer handling was introduced some time ago. We're looking into a fix. See https://github.com/eclipse/rdf4j/issues/1425 – Jeen Broekstra May 24 '19 at 05:25

1 Answers1

2

This turned out to be caused by a bug in Rdf4j (see https://github.com/eclipse/rdf4j/issues/1425). It has been fixed in release 2.5.2.

Jeen Broekstra
  • 21,642
  • 4
  • 51
  • 73