0

I'm trying to clear a moderately large TDB graph, around 13 million triples, so I'm paginating the triples retrieval to avoid OutOfMemory issues. Here's my code:

private void clearDataset() {
        
    int offset = 0;
    int page = 1;
    long totalTriples = getModelSize();
        
    while (offset < totalTriples) {
        StmtIterator modelPage = getModelPage();        
        dataset.begin(ReadWrite.WRITE);
            
        try {           
            m = dataset.getNamedModel(graph);
            m.remove(modelPage);
            dataset.commit();
            offset = PAGE_SIZE * page++;        
        }   
        finally { 
            dataset.end();      
        }   
        System.out.println("Remaining triples: " + getModelSize());
    }
}

private StmtIterator getModelPage() {
        
    Model submodel = ModelFactory.createDefaultModel();
    String query = "SELECT ?s ?p ?o WHERE {?s ?p ?o .} LIMIT " + PAGE_SIZE; 
    dataset.begin(ReadWrite.READ);
    
    try {   
        m = dataset.getNamedModel(graph);
        QueryExecution qe = QueryExecutionFactory.create(query, m);
        ResultSet resultSet = qe.execSelect();

        while (resultSet.hasNext()) {
            QuerySolution next = resultSet.next();
            Resource subject = next.getResource("?s");
            Property predicate = submodel.createProperty(next.get("?p").toString());
            RDFNode object = next.get("?o");
            submodel.add(subject, predicate, object);
        }
    }
    finally { 
        dataset.end(); 
    }
        
    return submodel.listStatements();
}

private long getModelSize() {
        
    long modelSize = 0;
    dataset.begin(ReadWrite.READ);
        
    try {   
        m = dataset.getNamedModel(graph);
        modelSize = m.size();
    }
    finally { 
        dataset.end(); 
    }   
    return modelSize;
}

It works fine for a small graph, less than 1 million triples, only I use an also small PAGE_SIZE so it's not done in a single pass and I can check the pagination:

Remaining triples: 613244
Remaining triples: 513244
Remaining triples: 413244
Remaining triples: 313244
Remaining triples: 213244
Remaining triples: 113244
Remaining triples: 13244
Remaining triples: 0

For the larger graph, I'm using PAGE_SIZE = 1000000. It goes fine for a couple of pages, then crashes:

Remaining triples: 12338413
Remaining triples: 11338413
Remaining triples: 10338413
Remaining triples: 9338413
Remaining triples: 8338413
org.apache.jena.sparql.JenaTransactionException: end() called for WRITE transaction without commit or abort having been called. This causes a forced abort.
        at org.apache.jena.tdb.transaction.Transaction.close(Transaction.java:363)
        at org.apache.jena.tdb.transaction.DatasetGraphTxn.end(DatasetGraphTxn.java:77)
        at org.apache.jena.tdb.transaction.DatasetGraphTransaction.end(DatasetGraphTransaction.java:223)
        at org.apache.jena.sparql.core.DatasetImpl.end(DatasetImpl.java:164)
        ...

Something is making it skip the commit() halfway. I wonder if there is still some memory issue in disguise. Would the PAGE_SIZE have something to do with it? Or have I missed something else?

vivss
  • 21
  • 1
  • 1
  • 5
  • Was there also an exception? Add `catch (Throwable) { th.printStackTrace(); }` before the finally to find out why the commit was missed which may be an OOME. Try `org.apache.jena.tdb.transaction.TransactionManager.QueueBatchSize = 0`. This is one of the design issues that is fixed by TDB2. – AndyS Jul 06 '23 at 21:39
  • `dataset.getGraph(....).remove(s, p, o);` slice-delete should do the paging for you. – AndyS Jul 06 '23 at 21:39

0 Answers0