What is a 'dataset' in the context of a SPARQL query?

Question

The SPARQL specification mentions that the FROM clause can be used to specify a dataset.

A SPARQL query may specify the dataset to be used for matching by using the FROM clause and the FROM NAMED clause to describe the RDF dataset.

What is a "dataset" in the context of SPARQL? I'm very familiar with databases in general, and I understand in principle that a query for data phrased in a language such as SQL is then executed against a dataset to resolve some subset of that dataset.

I'm trying to understand the following query:

prefix cpmeta: <...some_domain>

select distinct
?uri
?label
?stationId

from <...some_domain>
from <...some_domain>
from <...some_domain>
from <...some_domain>
from named <...some_domain>

where {

    { ?uri rdfs:label ?label }

    UNION

    { ?uri cpmeta:hasName ?label }

    UNION 

    {
        graph <...some_domain> {
            ?uri a cpmeta:Station .
            ?uri cpmeta:hasName ?label .
        }
    }

    ?uri cpmeta:hasStationId ?stationId
}

limit 100

So from the specification documentation I understand in principle that

There are 4 datasets specified, and (I think)
One 'RDF dataset' is defined

However. The query actually executes (but with slightly different results) if I leave out the FROM and FROM NAMED clauses:

prefix cpmeta: <...some_domain>

select distinct
?uri
?label
?stationId

where {

    { ?uri rdfs:label ?label }

    UNION

    { ?uri cpmeta:hasName ?label }

    UNION 

    {
        graph <...some_domain> {
            ?uri a cpmeta:Station .
            ?uri cpmeta:hasName ?label .
        }
    }

    ?uri cpmeta:hasStationId ?stationId
}

limit 100

So clearly??? there is already a dataset specified. Is that via the prefix?

Questions:

Why is an RDF dataset identified differently to a regular dataset (FROM vs FROM NAMED)
The URI for the prefix is actually reused in a FROM statement. What is the difference between a prefix and a FROM clause?

This question - Specifying dataset within a SPARQL query - shows how to specify a dataset, but doesn't explain what that means in the context of a SPARQL query and in the context of however that SPARQL query is resolved to actual data.

This question - FROM clause in SPARQL queries - mentions that a SPARQL query without a FROM clause is executed against a default dataset. But then why would omitting all datasets still result in data returned by the query?

Ah. I see that specifying multiple `FROM` clauses is actually defined in the documentation: https://www.w3.org/TR/sparql11-query/#unnamedGraph. `If a query provides more than one FROM clause, providing more than one IRI to indicate the default graph, then the default graph is the RDF merge of the graphs obtained from representations of the resources identified by the given IRIs.` — Zach Smith, Feb 21 '20 at 13:07
And this seems a relevant discussion to link to WRT to how graphs are merged. https://www.w3.org/2011/rdf-wg/track/issues/17. seems complicated — Zach Smith, Feb 21 '20 at 13:14

bastbijl · Accepted Answer · 2020-02-21T15:46:47.790

Comparing the execution of a SPARQL query with SQL queries is a bit tricky. SPARQL is more high level.

Datasets

An endpoint (e.g. a database like Virtuoso, GraphDB) has some freedom (not) to implement SPARQL concepts.

The dataset is such a concept. Usually a graph database allows you to create a repository which is equivalent to a database in the SQL world. Inside this triples are stored, and these triples can be grouped in named graphs. The GRAPH construct helps you te select which set to look in.

The repository is the dataset you are referring to.

Very few databases support querying datasets/repositories that are not hosted in that same database. For very obvious reasons.

SPARQL

The less precise your query, the more data it is matched to. Using the GRAPH <...> {} can narrow down the sets to match some triples to without the need to specify a full sub query

Don't confuse datasets with namespaces. The ID's in the world of RDF are always a URI's. The first part of a URI usually mentions the organisation that invented the ID. But still, they are just the ID. Using prefixes makes the ID look shorter.

You could put each triple in a separate graph, which turns the name of the graph into an identifier of the triple. This is not intended, but also not forbidden usage.

score 1 · Answer 2 · answered Feb 21 '20 at 14:56

An RDF Dataset is a collection of graphs. It has one default, unnamed graph and zero or more named graphs.

A SPARQL endpoint has a dataset to query. If you don't do anything else, the query executes against whatever the endpoint you send the query to has as its RDF Dataset.

That's why the OP query returned results. The endpoint already had the RDF dataset to query.

Some (a minority, not all) endpoints allow the query change the RDF dataset for the query using FROM and FROM NAMED. These two clauses describe the RDF dataset required. The URIs may refer to graphs on the web or graphs in the default dataset depending on implementation (graphs in the default dataset is more common in my experience).

The SPARQL protocol for query also optional default-graph-uri and named-graph-uri parameters that function like FROM or FROM NAMED. Again, not all endpoints respect the parameters.

The correct way to access a named graph during query execution is with GRAPH, not FROM.

What is a 'dataset' in the context of a SPARQL query?

2 Answers2