3

I am using the below mentioned query to obtain the wikidata lable for a given term.

SELECT ?item WHERE {
  ?item rdfs:label "Word2vec"@en
}

The output is wd:Q22673982

However, when I spell Word2vec as word2vec (i.e. all characters are simple letters) I get "No results" from the above query.

Therefore, I would like to know if there is a way to get how the term is in wikidata and get its label?

i.e. if I enter with all characters lower cased, how to identify the equivalent wikidata term and return its corresponding label?

logi-kal
  • 7,107
  • 6
  • 31
  • 43
EmJ
  • 4,398
  • 9
  • 44
  • 105

2 Answers2

7

The comments by AKSW are a better solution that the accepted answer, but since AKSW is not in the habit of posting proper answers, I'll do it for him...

We don't know your use-case, but if you're just trying to make a simple search in Wikidata entities, other services, such as the MediaWiki API entity search might be more efficient. You can even use it inside SPARQL, e.g.:

SELECT * {
    SERVICE wikibase:mwapi {
        bd:serviceParam wikibase:api "EntitySearch".
        bd:serviceParam wikibase:endpoint "www.wikidata.org".
        bd:serviceParam mwapi:search "word2vec".
        bd:serviceParam mwapi:language "en".
        ?item wikibase:apiOutputItem mwapi:item.
        ?num wikibase:apiOrdinal true.
    }
    ?item (wdt:P279|wdt:P31) ?type
}
ORDER BY ?num
LIMIT 20

Run this query live

What's going on in this query?

  1. The SERVICE call to wikibase:mwapi is not standard SPARQL, but a SPARQL extension that calls the Mediawiki API, in particular its entity search. More about that in the manual. What matters is the search term as value to mwapi:search, and the two lines that bind the found item to the variable ?item, and its rank in the search results to ?num.
  2. The line ?item (wdt:P279|wdt:P31) ?type binds the type of each item to the variable ?type. It takes into account both the “subclass of” and “instance of” properties.
  3. ORDER BY ?num makes sure that the results are ordered by the rank, that is, the best match comes first, the second best match second, etc.
  4. LIMIT 20 keeps only the first 20 results in case there are more than 20.
  5. SELECT * means return all variables that were bound in the query, so in this case it will be ?item, ?type and ?num.

Extending it for multiple search terms

As per comments, this can be extended to run for multiple search terms:

SELECT * {
    VALUES ?searchTerm { "word2vec" "fasttext" "natural language processing" "deep learning" "support vector machine" }
    SERVICE wikibase:mwapi {
        bd:serviceParam wikibase:api "EntitySearch".
        bd:serviceParam wikibase:endpoint "www.wikidata.org".
        bd:serviceParam wikibase:limit 10 .
        bd:serviceParam mwapi:search ?searchTerm.
        bd:serviceParam mwapi:language "en".
        ?item wikibase:apiOutputItem mwapi:item.
        ?num wikibase:apiOrdinal true.
    }
    ?item (wdt:P279|wdt:P31) ?type
}
ORDER BY ?searchTerm ?num

Run this query live

  • The search terms are provided in a VALUES clause and bound to the ?searchTerm variable
  • That variable is then used in the service call
  • The LIMIT 20 now no longer works because it would limit the total number of results instead of just for one term, so I removed it
  • Instead, added wikibase:limit to the service parameters
  • Changed the ordering so that it first orders by search term and then by rank
cygri
  • 9,412
  • 1
  • 25
  • 47
  • wowww. thanks a lot for the great explaination. :) Just a quick question. If I want to add `word2vec` as a variable how can I edit the above code? I look forward to hearing from you. Thank you very much once again :) – EmJ Apr 29 '19 at 08:47
  • 1
    @Emi It depends. What do you want to achieve by making the search term a variable? – cygri Apr 29 '19 at 08:52
  • I actually have a list of words as follows: mylist = ['word2vec', 'fasttext', 'natural language processing', 'deep learning', 'support vector machine']. Now I want to iteratate through the list as `for item in mylist:` and get their wikidata label. Please let me know if my description is not clear. Looking forward to hearing from you. – EmJ Apr 29 '19 at 08:57
  • 1
    @Emi Yes that can be done. I've updated the answer. And one little request for the future: If you have follow-on questions, it's usually best to post them as a new StackOverflow question. That way, others also get a chance to see the question and provide answers. – cygri Apr 29 '19 at 10:31
  • thanks a lot for the update. I really appreciate it. Thank you once again :) – EmJ Apr 29 '19 at 11:51
3

If you're unsure of the precise spelling or capitalisation, you can use a filter function to perform the match. For example, to match regardless of capitalisation, you could use the LCASE() (or UCASE()) function, as follows:

SELECT ?item WHERE {
  ?item rdfs:label ?label
  FILTER(LCASE(STR(?label)) = "word2vec")
}

This transforms any found label to lower-case and the compares to the lower-case string.

There's a whole host of different functions you can use for string manipulation, there's good overview in the SPARQL 1.1 W3C Recommendation.

NOTE doing this kind of query is significantly more expensive (in terms of execution time), because the engine will have to do a sequential scan over all possible matches. Like @AKSW mentioned in the commments, the query as-is is likely to time out when you execute it on the Wikidata public endpoint. It would probably help a lot if you made the query more specific by adding additional triple patterns.

Update If you have a look at the information available for wd:Q22673982 (you can browse it at https://www.wikidata.org/wiki/Q22673982 ) you'll see that, among other things, it's a subclass of "word embeddding" (wd:Q18395344). So what you could do for example, instead of just asking for every ?item that has a rdfs:label, is ask for all items that are a subclass of wd:Q18395344 and have this label, like this:

SELECT DISTINCT ?item WHERE {
  ?item wdt:P279 wd:Q18395344;
        rdfs:label ?label
  FILTER(LCASE(STR(?label)) = "word2vec")
}

Unfortunately, Wikidata uses rather cryptic identifiers for its properties and relations. Suffice to say that wdt:P279 corresponds to the "subclass" relation. The DISTINCT was something I added because otherwise you get the same answer 10 or more times.

Jeen Broekstra
  • 21,642
  • 4
  • 51
  • 73
  • 2
    you could additionally mention that those queries will be executed much slower because of a full scan compared to using index like `pos`. At least as long as the triple store doesn't provide some kind of fulltext index or similar extension. For example, your query is likely to timeout on the Wikidata endpoint. Indeed, adding more triple patterns that make the query more specific before the FILTER is applied would also help. I'm sure the TO will ask why the query isn't working right now – UninformedUser Apr 29 '19 at 04:45
  • 2
    Good point, hadn't fully considered this was on a public endpoint. I'll update. – Jeen Broekstra Apr 29 '19 at 05:08
  • @JeenBroekstra Thanks a lot for the great answer. As @AKSW has mentioned your first query got timed out. However, I am not sure what you meant by . Does it mean something like `computer science`? – EmJ Apr 29 '19 at 05:35
  • 1
    @Emi I meant anything else known about this resource that you're matching. I'll update to clarify. – Jeen Broekstra Apr 29 '19 at 06:12
  • 1
    @Emi what Jeen Broesktra is saying, it's easier for a triple store to search in a more restricted space like, persons, books, or whatever. We don't know your use-case, but if you're just trying to make a simple search in Wikidata entites, other services might be more efficient. – UninformedUser Apr 29 '19 at 06:20
  • 1
    @Emi for example, you can use the Mediawiki API entity search and you could even use it inside SPARQL, e.g. `SELECT * WHERE { SERVICE wikibase:mwapi { bd:serviceParam wikibase:api "EntitySearch" . bd:serviceParam wikibase:endpoint "www.wikidata.org" . bd:serviceParam mwapi:search "word2vec" . bd:serviceParam mwapi:language "en" . ?item wikibase:apiOutputItem mwapi:item . ?num wikibase:apiOrdinal true . } ?item (wdt:P279|wdt:P31) ?type } ORDER BY ASC(?num) LIMIT 20` – UninformedUser Apr 29 '19 at 06:20
  • 1
    @AKSW good stuff - I feel that almost qualifies as a separate answer. I am not really up to speed with the extensions to standard SPARQL that Wikidata offers. Would be good to have both approaches up side-by-side! – Jeen Broekstra Apr 29 '19 at 06:22
  • @AKSW Thanks a lot for the great answer. It would be really great if you could post it as a seperate answer. Moreover, it would be a great assistence if you could explain what happen with the query (since I am new to this area, I still could not understand what happens with the query). Thank you very much :) – EmJ Apr 29 '19 at 06:38
  • @AKSW the query you mentioned is clear to me now. I would like to know why this line is important `?item (wdt:P279|wdt:P31)`. Is there any reason why you selected those two relationships in the query? Looking forward to hearing from you :) – EmJ Apr 29 '19 at 12:07
  • 1
    @Emi No, this isn't important. It just shows that the entity search via Mediawiki API doesn't allow for restricting the type of the entities. So you might get back different things like persons, books, places etc. - and this is just shown here by using the triple pattern `?item (wdt:P279|wdt:P31) ?type `. You can check what I mean by looking at the labels: – UninformedUser Apr 29 '19 at 12:45
  • 1
    @Emi `SELECT ?item ?itemLabel ?type ?typeLabel WHERE { SERVICE wikibase:mwapi { bd:serviceParam wikibase:api "EntitySearch" . bd:serviceParam wikibase:endpoint "www.wikidata.org" . bd:serviceParam mwapi:search "word2vec" . bd:serviceParam mwapi:language "en" . ?item wikibase:apiOutputItem mwapi:item . ?num wikibase:apiOrdinal true . } ?item (wdt:P279|wdt:P31) ?type SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" }} ORDER BY ASC(?num) LIMIT 20` - you can also see that ordering by `?num` does prefer more exact matches in the ranking. – UninformedUser Apr 29 '19 at 12:46
  • @AKSW Just a quick question. What happens if I use the below mentioned code: `SELECT ?term ?item WHERE { SERVICE wikibase:mwapi { bd:serviceParam wikibase:api "EntitySearch" . bd:serviceParam wikibase:endpoint "www.wikidata.org" . bd:serviceParam mwapi:search ?term . bd:serviceParam mwapi:language "en" . bd:serviceParam mwapi:limit 1 . ?item wikibase:apiOutputItem mwapi:item . } ?s ?p ?item .`. What is exactly meant by `?s` and `?p` in the last line? I look forward to hearing from you. Thank you :) – EmJ Apr 29 '19 at 13:30
  • 1
    `?s ?p ?item .` is just a triple pattern, basically incoming edges of the items returned by the entity search. That's standard SPARQL. I'm wondering what you want to achieve by this query? You did not specify a search term, why? – UninformedUser Apr 29 '19 at 14:03
  • @AKSW Thanks a lot for your comment. So, if I understand you correctly `?s ?p ?item .` is a RDF triplet where `?s` and `?p` denote `subject` and `predicate`, and `?item` denotes object. This returns `subject-object-predicate` triplets in the wikidata RDF graph that fulfills our pattern? Please kindly correct me if I am wrong. Looking forward to hearing from you. Thank you :) – EmJ Apr 30 '19 at 00:18
  • 1
    @Emi that's basically correct, yes. Nevertheless, you have to define the search term and not use a variable `?term` - this doesn't make sense, you can't call the entity search service without a search term – UninformedUser Apr 30 '19 at 05:51
  • @AKSW thanks a lot for your valuable comments. I learnt a lot from you :) – EmJ Apr 30 '19 at 09:03
  • @AKSW Please let me know your thoughts on this question: https://stackoverflow.com/questions/55920836/how-to-retrieve-the-categorical-details-in-wikidata Thank you very much :) – EmJ Apr 30 '19 at 12:20