I am trying to generate some user statistics from a triple store using SPARQL. Please see the query below. How can this be improved? Am I doing something evil here? Why is this consuming so much memory? (see the background story at the end of this post)
I prefer to do the aggregation and the joins all inside the triple store. Splitting up the query would mean that I had to join the results "manually", outside the database, loosing the efficiency and optimizations of the triple store. No need to reinvent the wheel for no good reason.
The query
SELECT
?person
(COUNT(DISTINCT ?sent_email) AS ?sent_emails)
(COUNT(DISTINCT ?received_email) AS ?received_emails)
(COUNT(DISTINCT ?receivedInCC_email) AS ?receivedInCC_emails)
(COUNT(DISTINCT ?revision) AS ?commits)
WHERE {
?person rdf:type foaf:Person.
OPTIONAL {
?sent_email rdf:type email:Email.
?sent_email email:sender ?person.
}
OPTIONAL {
?received_email rdf:type email:Email.
?received_email email:recipient ?person.
}
OPTIONAL {
?receivedInCC_email rdf:type email:Email.
?receivedInCC_email email:ccRecipient ?person.
}
OPTIONAL {
?revision rdf:type vcs:VcsRevision.
?revision vcs:committedBy ?person.
}
}
GROUP BY ?person
ORDER BY DESC(?commits)
Background
The problem is that I get the error "QUERY MEMORY LIMIT REACHED" in AllegroGraph (please also see my related SO question). As the repository only contains around 200k triples which easily fit into an (ntriples) input file of ca. 60 MB, I wonder how executing the query results requires more than 4 GB RAM, which is roughly two orders of magnitude higher.