1

I am doing a sparql query on the data from the Lahman baseball database. Here's some sample data to showcase what my query needs to do.

@prefix ma: <http://mydataset.com/ns/master#> .
ma:billybo01 ma:nameFirst "Billy" .
ma:billybo01 ma:nameLast "Bored" .
ma:chrisgow01 ma:nameFirst "Chris" .
ma:chrisgow01 ma:nameLast "Gowan" .
ma:bradlee01 ma:nameFirst "Brad" .
ma:bradlee01 ma:nameLast "Lee" .


@prefix teamQ2: <http://mydataset.com/ns/teamQ2#> .
@prefix yearQ2: <http://mydataset.com/ns/yearQ2#> .
@prefix ma: <http://mydataset.com/ns/master#> .
teamQ2:BS1 yearQ2:1871 ma:billybo01 .
teamQ2:BS1 yearQ2:1872 ma:billybo01 .
teamQ2:BS1 yearQ2:1873 ma:billybo01 .
teamQ2:LAN yearQ2:1874 ma:billybo01 .

teamQ2:LAN yearQ2:1871 ma:chrisgow01 .
teamQ2:LAN yearQ2:1872 ma:chrisgow01 .

teamQ2:BS1 yearQ2:1871 ma:bradlee01 .
teamQ2:BS1 yearQ2:1872 ma:bradlee01 .

I am trying to get all first and last names of players who only played for the team LAN and no other team. My attempt is what I show below. I was hoping to query for all players who were on the team LAN in any year, get their masterID, then subtract all players from that set that ever appeared on a team other than LAN. Then I just match up the masterId with the first and last name at the end. Right now it is returning the data as if the MINUS and FILTER !EXISTS cancel each other out and it just returns all players who played on LAN. Is there something else I need to use other than MINUS or the FILTER !EXISTS?

PREFIX ma: <http://mydataset.com/ns/master#>
PREFIX teamQ2: <http://mydataset.com/ns/teamQ2#> 
SELECT DISTINCT ?nameFirst ?nameLast
WHERE
{
  teamQ2:LAN ?yearID ?masterID .
  MINUS{FILTER (
      !EXISTS {
          teamQ2:LAN ?yearID ?nonLANmasterID .
      }
      )}
  ?masterID ma:nameLast ?nameLast .
  ?masterID ma:nameFirst ?nameFirst .
}
ORDER BY ?nameLast ?nameFirst
TallTed
  • 9,069
  • 2
  • 22
  • 37
Chase Lundell
  • 93
  • 1
  • 11
  • 1
    The closest to SQL "NOT IN" in SPARQL is "FILTER NOTEXISTS" . Put that inside a MINUS is double negation. There is a typo in the data: teamQ2:LANyearQ2:1872 ma:chrisgow01 . needs a space after LAN. – AndyS Jul 12 '18 at 23:02
  • Do you encode the time inside the URI, e.g. is there also Q1:LAN, Q3:LAN, etc. ? this is really bad. you should also look for a tutorial how to model n-ary relations and in particular events in RDF. blank nodes is the usual way to go – UninformedUser Jul 13 '18 at 01:50
  • No I'm not encoding time like you said I just put Q2 to differentiate the data for other queries I was doing. I'm doing a project where I port the same data set to different DBMSs to highlight their differences. I made a Python script that ports data from a csv file which was created from a sql table. So that's why the data seems clunky. I wasn't sure what a better way was to incorporate yearID into the data. Do you know of a link or tutorial that would explain n-ary relations and blank nodes? That would really help me understand this better I think. – Chase Lundell Jul 13 '18 at 18:02
  • 1
    @ChaseLundell have a look here: https://stackoverflow.com/questions/32923213/how-can-i-express-additional-information-time-probability-about-a-relation-in – Jeen Broekstra Jul 13 '18 at 20:06
  • @Jeen BroekstraThis is awesome thanks so much! – Chase Lundell Jul 13 '18 at 21:24

2 Answers2

1

The modelling is a rather unusual but here is the outline of a query based on your description:

  1. Find all the teamQ2:LAN ?p ?o . triples.

  2. Reject (FILTER NOT EXISTS) any match where there is a different (FILTER(?otherteam != teamQ2:LAN ) subject the property-object pair ?p ?o.

    (This is actually "did not play for the same team in the same year, because ?p is yearQ2:1871 etc includes the year.)

PREFIX ma: 
PREFIX teamQ2:  
SELECT *
WHERE
{
  teamQ2:LAN ?p ?masterID .
  FILTER NOT EXISTS {
      ?otherteam ?p ?masterID .
      FILTER(?otherteam !=  teamQ2:LAN )
  }
  ?masterID ma:nameLast ?nameLast .
  ?masterID ma:nameFirst ?nameFirst .
}

In this case it can also be done as:

  1. Find players of teamQ2:LAN.
  2. Find players not of teamQ2:LAN.
  3. MINUS on ?masterID the two above.
PREFIX ma: 
PREFIX teamQ2:  
SELECT *
WHERE
{
  teamQ2:LAN ?yearID ?masterID .
  MINUS {
      ?otherteam ?yearID ?masterID .
      FILTER(?otherteam !=  teamQ2:LAN )
  }
  ?masterID ma:nameLast ?nameLast .
  ?masterID ma:nameFirst ?nameFirst .
}

I think you should consider recording the data as events in time:

player "play for team" in "year" or in triples:

ma:bradlee01 :playedForInYear [ :team teamQ2:BS1 ; :year 1871 ].
ma:bradlee01 :playedForInYear [ :team teamQ2:BS1 ; :year 1872 ].
AndyS
  • 16,345
  • 17
  • 21
  • if I'm not wrong, `?o` should be `?masterId` in your first query? – UninformedUser Jul 13 '18 at 01:43
  • Thanks - fixed. I trying make the point it is about graph shape. The incorporation of year-ness into the property is a bit weird IMO. – AndyS Jul 13 '18 at 11:30
  • This is great, I will probably need to restructure the data like you said at the end because I need the query to be able to not say "not played for the same team in the same year". I really need the query to say "not played on a different team other than LAN in any year" or "only played on LAN for every year played" If I restructured the data like you mentioned at the end can I still accomplish that? how do you query the array of objects in the triple? – Chase Lundell Jul 13 '18 at 18:07
1

In your example, I assume the answer you want back is "Chris Gowan", since he is the only player there that played for LAN, and not for any other team, correct?

The way to approach it is to build it up, step by step. First, a query to just get everyone who played for LAN:

SELECT DISTINCT ?firstName ?lastName
WHERE { 
         teamQ2:LAN ?yearId ?player .
         ?player ma:nameLast ?lastName;
                 ma:nameFirst ?firstName .
}

Result will be:

Evaluating SPARQL query...
+-------------------------------------+-------------------------------------+
| firstName                           | lastName                            |
+-------------------------------------+-------------------------------------+
| "Billy"                             | "Bored"                             |
| "Chris"                             | "Gowan"                             |
+-------------------------------------+-------------------------------------+
2 result(s) (75 ms)

Now, we want to add a filter that removes players that also played for another team in another year. A player played for another team if they are the object of a triple where the subject is a team other than LAN.

I'll use MINUS because that's what you asked. There's also FILTER NOT EXISTS which I often find more intuitive to work with, but they're roughly the same thing. So the MINUS condition we need should look something like this:

 MINUS { ?otherTeam ?otherYearId ?player }

But this isn't enough because we need to put restrictions on the allowed values for ?otherTeam and ?otherYearId (otherwise this condition would just match everything).

To be frank, this is slightly tricky to do correctly, because the data model is so poor. It would be a lot easier if, for example, all team resources were of rdf:type teamQ2:Team, not to mention the fact that using the year as the property name for the relation between a team and a player is not a great way to model this. But I digress.

We'll use a poor version of identifying teams: we'll assume that every resource that starts with the teamQ2 prefix is a team identifier. So we want to MINUS all the teams for which the player played where the subject starts with teamQ2 but does not end with LAN:

 MINUS { ?otherTeam ?otherYearId ?player . 
         FILTER(STRSTARTS(STR(?otherTeam), STR(teamQ2:)) 
         FILTER(!STRENDS(STR(?otherTeam), "LAN"))
 }

which makes this the full query:

SELECT DISTINCT ?firstName ?lastName
WHERE { 
         teamQ2:LAN ?yearId ?player .
         ?player ma:nameLast ?lastName;
                 ma:nameFirst ?firstName .
         MINUS { ?otherTeam ?otherYearId ?player . 
                  FILTER(STRSTARTS(STR(?otherTeam), STR(teamQ2:))) 
                  FILTER(!STRENDS(STR(?otherTeam), "LAN"))
         }
}

Result:

Evaluating SPARQL query...
+-------------------------------------+-------------------------------------+
| firstName                           | lastName                            |
+-------------------------------------+-------------------------------------+
| "Chris"                             | "Gowan"                             |
+-------------------------------------+-------------------------------------+
1 result(s) (2 ms)
Jeen Broekstra
  • 21,642
  • 4
  • 51
  • 73