0

I want to fetch paginated data using SPARQL query for one type of record that has some duplicate attributes like type, image.

below query returns duplicates and hence pagination get wrong.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX schema:<http://schema.org/>
SELECT distinct ?uri ?label ?r ?type ?image ?ownership ?rating ?comments ?allOwners
FROM <http://sample.net/>
WHERE  {
  ?r rdf:type <http://schema.org/Relation> . 
  ?r schema:property ?uri.
  ?r schema:owner ?owner .
  ?r schema:ownership ?ownership .
  ?uri rdfs:label ?label .
  ?uri rdf:type ?type . 
  ?uri schema:image ?image .
  OPTIONAL {?r schema:comments ?comments .}
  OPTIONAL {?r schema:rating ?rating .}
  filter (?owner =<http://sample.net/resource/37654824-334f-4e57-a40c-4078cac9c579>)
} limit 20 offset 0

sample data

subject,predicate,object
Product-uri,type,Vehicle
Product-uri,type,Car
Product-uri,type,Toyota
Product-uri,image,Image-key1.png
Product-uri,image,Image-key2.png
Product-uri,image,Image-key3.png
Product-uri2,type,Vehicle
Product-uri2,type,Car
Product-uri2,type,Toyota
Product-uri2,image,Image-key21.png
Product-uri2,image,Image-key22.png
Product-uri2,image,Image-key23.png

if I query on this data to fetch list of unique products (where each product has multiple types & images) the total count will be 12 instead of 2.

  • what do you mean by duplicates? on different pages? if so, this is because you have to provide an ordering on the whole result as there is no implicit ordering except for implementation specific stuff, but there is never a guarantee. Long story short, use `ORDER BY` with one or more variables. – UninformedUser May 30 '21 at 07:46
  • and before you're asking or wondering, ordering makes the whole query slow. – UninformedUser May 30 '21 at 07:47

1 Answers1

1

As noted in the comments, the first important thing is to include an ORDER BY in your query whenever LIMIT and OFFSET are being used to step through a large solution set.

(ORDER BY cannot be applied until the entire solution set is found, so it may appear to slow the query (as also commented). In actuality, the query runs at the same speed, but when there is no ORDER BY, solutions may be returned as they're found, so some solutions may be returned quite quickly, but the full solution set will take very close to the same time with or without the ORDER BY.)

The DISTINCT applies to an entire solution row -- so if any column varies, you'll get rows that otherwise appear to be duplicates.

Your question does not make clear what you're seeing as "duplicates". Perhaps you could add some sample results and/or some sample data, so we have a better idea of what's not doing what you want it to.

TallTed
  • 9,069
  • 2
  • 22
  • 37