The same query on Sparql gives different results

Question

I read some questions related to my question, like Same sparql not returning same results, but I think is a little different.

Consider this query which I submit into http://live.dbpedia.org/sparql (Virtuoso endpoint) and get 34 triples as a result. Result Sparql

SELECT  ?pred ?obj
    WHERE { 
           <http://dbpedia.org/resource/Johann_Sebastian_Bach> ?pred ?obj
        FILTER((langMatches(lang(?obj), "")) ||
                      (langMatches(lang(?obj), "EN"))
          )
    }

Then, I used the same query in a code in python:

import rdflib
import rdfextras
rdfextras.registerplugins()

g=rdflib.Graph()
g.parse("http://dbpedia.org/resource/Johann_Sebastian_Bach")

PREFIX = """
                PREFIX dbp: <http://dbpedia.org/resource/>
"""

query = """
                SELECT ?pred ?obj
                    WHERE {dbp:Johann_Sebastian_Bach ?pred ?obj
                        FILTER( (langMatches(lang(?obj), "")) ||
                                (langMatches(lang(?obj), "EN")))}
"""
query = PREFIX + query
result_set = g.query(query)
print len(result_set)

This time, I get only 27 triples! https://dl.dropboxusercontent.com/u/22943656/result.txt

I thought it could be related to the dbpedia site. I repeated these queries several time and always got the same difference. Therefore, I downloaded the RDF file to test it locally, and used the software Protége to simulate the Virtuoso endpoint. Even though, I still have different results from the sparql submitted into Protége and Python, 31 and 27. Is there any explanation for this difference? And how can I get the same result in both?

I think they don't need to be declared in the Virtuoso endpoint, because it queries in DBpedia. But even if I remove the defaults prefix (PREFIX rdf: , PREFIX owl: , PREFIX xsd: , PREFIX rdfs: ), it doesn't change the result in python. — Marcelo, Nov 26 '13 at 15:44
What are the 31 results that you get? What are the 27 results that you get? — Joshua Taylor, Nov 26 '13 at 17:00
Do you have any reason to suppose that the data from http://live.dbpedia.org/sparql (i.e., from DBpedia Live) is going to be the same as the data in the main DBpedia (what http://dbpedia.org/resource/Johann_Sebastian_Bach gives you)? You're querying different datasets. — Joshua Taylor, Nov 26 '13 at 17:01
`lang` returns `""` for literals that don't have a language tag. I'm not sure how `langMatches` handles `""`, but what happens if you change `langMatches(lang(?obj),"")` to `lang(?obj) = ""`? — Joshua Taylor, Nov 26 '13 at 23:21
There was a bug in earlier versions of RDFlib where `lang(x)` for a literal `x` that didn't have a language tag would return `None` instead of `""`. It's mentioned in [this issue](http://code.google.com/p/rdfextras/issues/detail?id=15). — Joshua Taylor, Nov 26 '13 at 23:24
In trying to figure out what should happen here, I've asked a question on http://answers.semanticweb.com, [Is langMatches("","") true or false?](http://answers.semanticweb.com/questions/25434/is-langmatches-true-or-false) — Joshua Taylor, Nov 26 '13 at 23:52
I know we've got a working solution now, but what version of rdflib are you using? The later version of incorporates SPARQL querying into the code, so you don't need to import rdfextras, but I'm running into some trouble in trying to get your code to run under the later versions of rdflib — Joshua Taylor, Nov 27 '13 at 16:47
I'm using the version 4.0.1 and the python is 2.7.2. Yes, you are right! I've tested without rdfextras and it works. — Marcelo, Nov 27 '13 at 21:25

Joshua Taylor · Accepted Answer · 2013-11-27T12:08:59.827

As the question is written, there are a few possible problems. Based on the comments, the first one described here (about lang, langMatches, etc.) seems to be what you're actually running into, but I'll leave the descriptions of the other possible problems, in case someone else finds them useful.

`lang`, `langMatches`, and the empty string

lang is defined to return "" for literals with no language tags. According to RFC 4647 §2.1, language tags are defined as follows:

2.1. Basic Language Range

A "basic language range" has the same syntax as an [RFC3066] language tag or is the single character "*". The basic language range was originally described by HTTP/1.1 [RFC2616] and later [RFC3066]. It is defined by the following ABNF [RFC4234]:
language-range   = (1*8ALPHA *("-" 1*8alphanum)) / "*"
alphanum         = ALPHA / DIGIT

This means that "" isn't actually a legal language tag. As Jeen Broekstra pointed out on answers.semanticweb.com, the SPARQL recommendation says:

17.2 Filter Evaluation

SPARQL provides a subset of the functions and operators defined by XQuery Operator Mapping. XQuery 1.0 section 2.2.3 Expression Processing describes the invocation of XPath functions. The following rules accommodate the differences in the data and execution models between XQuery and SPARQL: …

Functions invoked with an argument of the wrong type will produce a type error. Effective boolean value arguments (labeled "xsd:boolean (EBV)" in the operator mapping table below), are coerced to xsd:boolean using the EBV rules in section 17.2.2.

Since "" isn't a legal language tag, it might be considered "an argument of the wrong type [that] will produce a type error." In that case, the langMatches invocation would produce an error, and that error will be treated as false in the filter expression. Even if it doesn't return false for this reason, RFC 4647 §3.3.1, which describes how language tags and ranges are compared, doesn't say exactly what should happen in the comparison, since it's assuming legal language tags:

Basic filtering compares basic language ranges to language tags. Each basic language range in the language priority list is considered in turn, according to priority. A language range matches a particular language tag if, in a case-insensitive comparison, it exactly equals the tag, or if it exactly equals a prefix of the tag such that the first character following the prefix is "-". For example, the language-range "de-de" (German as used in Germany) matches the language tag "de-DE-1996" (German as used in Germany, orthography of 1996), but not the language tags "de-Deva" (German as written in the Devanagari script) or "de-Latn-DE" (German, Latin script, as used in Germany).

Based on your comments and my local experiments, it appears that langMatches(lang(?obj),"") for literals without language tags (so really, langMatches("","")) is returning true in Virtuoso (as it's installed on DBpedia), Jena's ARQ (from my experiments), and Proégé (from our experiments), and it's returning false (or an error that's coerced to false) in RDFlib.

In either case, since lang is defined to return "" for the literals without a language tag, , you should be able to reliably include them in your results by changing langMatches(lang(?obj),"") with lang(?obj) = "".

Issues with the data that you're using

You're not querying the same data. The data that you download from

http://dbpedia.org/resource/Johann_Sebastian_Bach

is from DBpedia, but when you run a query against

http://live.dbpedia.org/sparql,

you're running it against DBpedia Live, which may have different data. If you run this query on the DBpedia Live endpoint and on the DBpedia endpoint, you get a different number of results:

SELECT count(*) WHERE { 
  dbpedia:Johann_Sebastian_Bach ?pred ?obj
  FILTER( langMatches(lang(?obj), "")  || langMatches(lang(?obj), "EN" ) )
}

DBpedia Live results 31
DBpedia results 34

Issues with `distinct`

Another possible problem, though it doesn't seem to be the one that you're running into, is that your second query has a distinct modifier, but your first one doesn't. That means that your second query could easily have fewer results than the first one.

If you run this query against the DBpedia SPARQL endpoint you should get 34 results, and that's the same whether or not you use the distinct modifiers, and it's the number that you should get if you download the data and run the same query against it.

select ?pred ?obj where { 
  dbpedia:Johann_Sebastian_Bach ?pred ?obj
  filter( langMatches(lang(?obj), "") || langMatches(lang(?obj), "EN") )
}

SPARQL results

Thanks Joshua. I removed the DISTINCT modidifier, and, as you mentioned, it didn't change the result. My main question is why the same query used in python and sparql (protege) ran against my local rdf data still gives me different results? — Marcelo, Nov 26 '13 at 22:53
I've realised that these differences are related to these triples: http://dbpedia.org/ontology/birthDate 1685-03-21 http://dbpedia.org/ontology/deathDate 1750-07-28 http://dbpedia.org/ontology/wikiPageID 9906294 http://dbpedia.org/ontology/wikiPageRevisionID 546793436 http://dbpedia.org/property/dateOfBirth 21 http://dbpedia.org/property/dateOfDeath 28 http://dbpedia.org/property/id 24 — Marcelo, Nov 26 '13 at 22:54
@user2725174 Are you still running the first query against DBpedia Live? DBpedia Live is _not the same data_ as the DBpedia release, which is what you get from http://dbpedia.org/resource/*`. It's not surprising that you get different results if you query different data. As I asked in a comment on the question, please update your question with the _actual_ different results that you're getting. — Joshua Taylor, Nov 26 '13 at 22:55
all have numbers in the objects! To be clear, I get less triples when I run the query in python. Could you explain that? Is there a way to get the same result in python? — Marcelo, Nov 26 '13 at 23:02
the last comments were based on the rdf data downloaded from DBpedia. When I run the query in the sparql (protege) and python they are different — Marcelo, Nov 26 '13 at 23:06
Of course they have numbers in the objects, that's to be expected because of your query. That's not a surprise, right? The lang of such a literal will match `""`. — Joshua Taylor, Nov 26 '13 at 23:06
@user2725174 Your question doesn't include anything about Protégé, only about Python and a webclient. We can't diagnose where you're getting differences, if we can't see _what_ differences you're getting. If you don't edit your question to show what results you're getting under one system and what results you're getting under another, we're _not_ going to be able to provide any help. — Joshua Taylor, Nov 26 '13 at 23:08
Yes, but these numbers are only in the query using protege. When I use python, they are not retrieved. These are the differences! — Marcelo, Nov 26 '13 at 23:10
@user2725174 That's the first time that you've made it clear _what_ the differences are. You're getting datatype literals that don't have a language tag when you run the query in Python, but with other query engines you do. It would have been _much_ easier if you could have _shown_ those results in the question; then someone else might have been able to _understand_ what was going on before this. Now it boils down to this: "This query … selects non-string literals as `?obj` when I use Protégé or DBpedia's Virtuoso, but these aren't selected when using Python's rdflib. Why?" — Joshua Taylor, Nov 26 '13 at 23:17
@user2725174 I don't have an answer for that yet, but _that's_ a much more concrete and specific question. — Joshua Taylor, Nov 26 '13 at 23:17
@user2725174 In a comment on the question, I added a possible workaround. Please take a look at that and try it out, and let me know what happens. — Joshua Taylor, Nov 26 '13 at 23:21
Thanks again Joshua! Just to let you know, the problem was with the langMatches(lang(?obj),""). When I changed it to lang(?obj) = "", it worked! — Marcelo, Nov 27 '13 at 00:03

The same query on Sparql gives different results

1 Answers1

lang, langMatches, and the empty string

2.1. Basic Language Range

17.2 Filter Evaluation

Issues with the data that you're using

Issues with distinct

`lang`, `langMatches`, and the empty string

Issues with `distinct`