0

Using DBpedia-Live SPARQL endpoint http://dbpedia-live.openlinksw.com/sparql, I am trying to count the total number of triples associated with the instances of type owl:Thing. As the count is really big, an exception is being thrown "Virtuoso 42000 Error The estimated execution time". To get rid of this I tried to use subselect, limit, and offset in the query. However, when the offset is greater than equal to the limit, the solution isn't working and the same exception is being thrown again (Virtuoso 42000 Error), can anyone please identify the problem with my query? Or suggest a workaround? Provided is the query I was trying:

select count(?s) as ?count
where
{
?s ?p ?o
  {
      select ?s
      where
      {
          ?s rdf:type owl:Thing.
      }
    limit 10000
    offset 10000
  }
}
TallTed
  • 9,069
  • 2
  • 22
  • 37
singha
  • 11
  • 1
  • 1
    works for me on http://dbpedia.org/sparql - note, it's a shared resources used by many people. You don'T have any performance guarantees nor a guarantee for uptime. The workaround is to load the DBpedia dump and process the data locally. In your case, you could even use UNIX commands like `grep`, etc – UninformedUser Apr 28 '19 at 12:38
  • Thanks for replying, unfortunately, the dumps would be a little older and am trying to perform an experiment with the current state of DBpedia and live changes it produces. I am still unsure why the query would work on the static DBpedia's endpoint and not on the live endpoint. I am assuming there would not be a drastic difference between the configurations of both environments. Thanks again for the response. – singha Apr 28 '19 at 14:10
  • Well, the most obvious difference is Virtuoso 7 vs. Virtuoso 8. That alone can lead to different query execution plans etc. Moreover, different servers, different Virtuoso config, there can be so many things making the difference. – UninformedUser Apr 28 '19 at 14:20
  • Thanks for the information. So, my takeaway from this conversation would be that there is no way to get the count of the triples related to owl:Thing class from DBpedia LIVE SPARQL endpoint. – singha Apr 28 '19 at 19:42
  • That's something I cannot answer, but only DBpedia Live maintainers, Virtuoso devs or more experienced SPARQL users. – UninformedUser Apr 28 '19 at 19:52
  • The error messages you report appear to be incomplete. It can be difficult to impossible to provide useful advice without complete error text, so please provide complete messages in future. – TallTed Apr 29 '19 at 13:34

1 Answers1

1

Your solution starts with patience. Virtuoso's Anytime Query feature returns some results when a timeout strikes, and keeps running the query in the background -- so if you come back later, you'll typically get more solutions, up to the complete result set.

I had to guess at your original query, since you only posted the piecemeal one you were trying to use --

select ( count(?s) as ?count )
where
{
          ?s rdf:type owl:Thing.
}

I got 3,923,114 within a few seconds, without hitting any timeout. I had set a timeout of 3000000 milliseconds (= 3000 seconds = 50 minutes) on the form -- in contrast to the endpoint's default timeout of 30000 milliseconds (= 30 seconds) -- but clearly hit neither of these, nor the endpoint's server-side configured timeout.

I think you already understand this, but please do note that this count is a moving target, and will change regularly as the DBpedia-Live content continues to be updated from the Wikipedia firehose.


Your divide-and-conquer effort has a significant issue. Note that without an ORDER BY clause in combination with your LIMIT/OFFSET clauses, you may find that some solutions (in this case, some values of ?s) repeat and/or some solutions never appear in a final aggregation that combines all those partial results.

Also, as you are trying to count triples, you should probably do a count(*) instead of count (?s). If nothing else, this helps readers of the query understand what you're doing.


Toward being able to adjust such execution time limits as your query is hitting -- the easiest way would be to instantiate your own mirror via the the DBpedia-Live AMI; unfortunately, this is not currently available for new customers, for a number of reasons. (Existing customers may continue to use their AMIs.) We will likely revive this at some point, but the timing is indefinite; you could open a Support Case to register your interest, and be notified when the AMI is made available for new users.


Toward an ultimate solution... There may be better ways to get to your actual end goal than those you're currently working on. You might consider asking on the DBpedia mailing list or the OpenLink Community Forum.

TallTed
  • 9,069
  • 2
  • 22
  • 37
  • 1
    well, the original query is `select ( count(?s) as ?count ) where { ?s rdf:type owl:Thing. ?s ?p ?o }` - i.e. the TO wants to get the number of triples in which resources of type `owl:Thing` are involved (or at least in subject position). And this query clearly times out: `Virtuoso 42000 Error The estimated execution time 354 (sec) exceeds the limit of 240 (sec).` because of a threshold of 240s in the Virtuoso config because the estimated time is above. Without changing the config, I doubt there is a workaround here. – UninformedUser Apr 29 '19 at 13:56
  • Only the DBpedia Live admin could help here and change the `MaxQueryCostEstimationTime` param, but honestly I wouldn't do it because of the fair use policy of the DBpedia Live service – UninformedUser Apr 29 '19 at 14:00
  • @AKSW is right, I am interested in the number of all the triples (not just resources) related to owl:Thing. Even MaxQueryCostEstimationTime is not a solution because there would be other queries that could be much more expensive, for e.g. if I replace owl:Thing with foaf:Document, then the query would become much more expensive as per the query planner. The most annoying thing is that even limit 1 and offset 1 is not working in the above query (on DBpedia Live). – singha Apr 29 '19 at 15:40
  • @singha - I've added a bit to my answer. You'd definitely need your own instance to execute queries as expensive as you describe. Others may have ideas on how to make less expensive queries to achieve the same end-goals. – TallTed Apr 29 '19 at 15:54