SPARQL query with multiple aggregates exceeds memory limit

Question

I am trying to generate some user statistics from a triple store using SPARQL. Please see the query below. How can this be improved? Am I doing something evil here? Why is this consuming so much memory? (see the background story at the end of this post)

I prefer to do the aggregation and the joins all inside the triple store. Splitting up the query would mean that I had to join the results "manually", outside the database, loosing the efficiency and optimizations of the triple store. No need to reinvent the wheel for no good reason.

The query

SELECT
    ?person
    (COUNT(DISTINCT ?sent_email) AS ?sent_emails)
    (COUNT(DISTINCT ?received_email) AS ?received_emails)
    (COUNT(DISTINCT ?receivedInCC_email) AS ?receivedInCC_emails)
    (COUNT(DISTINCT ?revision) AS ?commits)

WHERE {
  ?person rdf:type foaf:Person.

  OPTIONAL {
    ?sent_email rdf:type email:Email.
    ?sent_email email:sender ?person.
  }

  OPTIONAL {
    ?received_email rdf:type email:Email.
    ?received_email email:recipient ?person.
  }

  OPTIONAL {
    ?receivedInCC_email rdf:type email:Email.
    ?receivedInCC_email email:ccRecipient ?person.
  }

  OPTIONAL {
    ?revision rdf:type vcs:VcsRevision.
    ?revision vcs:committedBy ?person.
  }
}
GROUP BY ?person
ORDER BY DESC(?commits)

Background

The problem is that I get the error "QUERY MEMORY LIMIT REACHED" in AllegroGraph (please also see my related SO question). As the repository only contains around 200k triples which easily fit into an (ntriples) input file of ca. 60 MB, I wonder how executing the query results requires more than 4 GB RAM, which is roughly two orders of magnitude higher.

Given that the code already crashes for reasonably sized input, this doesn't really qualify as "working code". I'm moving this to Stack Overflow where, I think, it's more appropriate. — sepp2k, Dec 18 '12 at 16:39
possible duplicate of [Is it possible to aggregate over two resources in SPARQL?](http://stackoverflow.com/questions/12325974/is-it-possible-to-aggregate-over-two-resources-in-sparql) — Paul Sweatte, Oct 21 '13 at 17:55

score 0 · Accepted Answer · answered Jul 08 '14 at 13:00

Try splitting the computation in sub queries, for example:

SELECT
    ?person
    (MAX(?sent_emails_) AS ?sent_emails_)
    (MAX(?received_emails_ AS ?received_emails_)
    (MAX(?receivedInCC_emails_ AS ?receivedInCC_emails_)
    (MAX(?commits_) AS ?commits)
WHERE {
  { 
   SELECT 
          ?person 
          (COUNT(DISTINCT ?sent_email) AS ?sent_emails_) 
          (0 AS ?received_emails_) 
          (0 AS ?commits_) 
   WHERE {
    ?sent_email rdf:type email:Email.
    ?sent_email email:sender ?person.
    ?person rdf:type foaf:Person.
   } GROUP BY ?person 
  } union {
     (similar pattern for the others)
     ....
  }
}
GROUP BY ?person
ORDER BY DESC(?commits)

The objective is to:

avoid the generation of a huge number of rows in the result set that needs to be processed for aggregation
avoid the use of OPTIONAL{} patterns, that also should affect performance

There's nothing the matter with that. According to profile, the asker was here two hours ago, so he'll probably see the answer and get a notification soon. — Joshua Taylor, Jul 08 '14 at 13:04
Back then, I think I finally ended up with a solution like this. — cyroxx, Jul 08 '14 at 22:29

SPARQL query with multiple aggregates exceeds memory limit

1 Answers1