I'm trying to clear a moderately large TDB graph, around 13 million triples, so I'm paginating the triples retrieval to avoid OutOfMemory issues. Here's my code:
private void clearDataset() {
int offset = 0;
int page = 1;
long totalTriples = getModelSize();
while (offset < totalTriples) {
StmtIterator modelPage = getModelPage();
dataset.begin(ReadWrite.WRITE);
try {
m = dataset.getNamedModel(graph);
m.remove(modelPage);
dataset.commit();
offset = PAGE_SIZE * page++;
}
finally {
dataset.end();
}
System.out.println("Remaining triples: " + getModelSize());
}
}
private StmtIterator getModelPage() {
Model submodel = ModelFactory.createDefaultModel();
String query = "SELECT ?s ?p ?o WHERE {?s ?p ?o .} LIMIT " + PAGE_SIZE;
dataset.begin(ReadWrite.READ);
try {
m = dataset.getNamedModel(graph);
QueryExecution qe = QueryExecutionFactory.create(query, m);
ResultSet resultSet = qe.execSelect();
while (resultSet.hasNext()) {
QuerySolution next = resultSet.next();
Resource subject = next.getResource("?s");
Property predicate = submodel.createProperty(next.get("?p").toString());
RDFNode object = next.get("?o");
submodel.add(subject, predicate, object);
}
}
finally {
dataset.end();
}
return submodel.listStatements();
}
private long getModelSize() {
long modelSize = 0;
dataset.begin(ReadWrite.READ);
try {
m = dataset.getNamedModel(graph);
modelSize = m.size();
}
finally {
dataset.end();
}
return modelSize;
}
It works fine for a small graph, less than 1 million triples, only I use an also small PAGE_SIZE
so it's not done in a single pass and I can check the pagination:
Remaining triples: 613244
Remaining triples: 513244
Remaining triples: 413244
Remaining triples: 313244
Remaining triples: 213244
Remaining triples: 113244
Remaining triples: 13244
Remaining triples: 0
For the larger graph, I'm using PAGE_SIZE = 1000000
. It goes fine for a couple of pages, then crashes:
Remaining triples: 12338413
Remaining triples: 11338413
Remaining triples: 10338413
Remaining triples: 9338413
Remaining triples: 8338413
org.apache.jena.sparql.JenaTransactionException: end() called for WRITE transaction without commit or abort having been called. This causes a forced abort.
at org.apache.jena.tdb.transaction.Transaction.close(Transaction.java:363)
at org.apache.jena.tdb.transaction.DatasetGraphTxn.end(DatasetGraphTxn.java:77)
at org.apache.jena.tdb.transaction.DatasetGraphTransaction.end(DatasetGraphTransaction.java:223)
at org.apache.jena.sparql.core.DatasetImpl.end(DatasetImpl.java:164)
...
Something is making it skip the commit()
halfway. I wonder if there is still some memory issue in disguise. Would the PAGE_SIZE
have something to do with it? Or have I missed something else?