0

We are using Spacy for entity extraction in Python 3 for a non-English language. We get back a list of entities which we want to store in a database. To get a better understanding of the entities we want to use Wikidata or any other publicly available source to find the correct meaning of a word. Because Spacy is trained on a non-English language set a lot of words are seen as the same type (for example a location) while if we could match it with Wikidata we should see the location found by Spacy is actually a city or a point of interest. That way the data is much more detailed.

We tried to use different API's to find the answer to our question. One of them was https://nl.wikipedia.org/w/api.php?action=query&format=json&prop=pageprops&titles=Bologna . We expected that the we could use the Q347019 number in SPARQL to get the ontology (http://dbpedia.org/ontology/City) . But we couldn't find the SPARQL that gave back the results we wanted and it's not ideal to need multiple requests to get the data we want.

We also tried http://lookup.dbpedia.org/api/search.asmx/PrefixSearch?QueryClass=&MaxHits=1&QueryString=Bologna , but this API seems to give back different formats when we query the wide range of entities which makes it difficult to automatically match the response we want and story it in the database. It does give back the ontologies we are looking for.

I am looking for an efficient way to query any Wikipedia / wikidata / dbpedia source (a non-commercial one) to get the ontology urls (http://dbpedia.org/ontology/City) of an entity based on a string ("Bologna").

user1923728
  • 153
  • 1
  • 13
  • 2
    The task you're trying to solve now is called *entity disambiguation* or *entity linking*. And just using a text lookup would not really work given that you also have to consider the context. That said, for DBpedia there is already DBpedia Spotlight as a tool. But, even better for Spacy in version 3.0 you'll get the Wikidata linking for free, see the talk from the Spacy IRL this year ([youtube playlist](https://www.youtube.com/playlist?list=PLBmcuObd5An4UC6jvK_-eSl6jCvP1gwXc)). – UninformedUser Sep 04 '19 at 03:15
  • 1
    [Slides](https://drive.google.com/file/d/1EuGxcQLcXvjjkZ-KRUlwpr_doBVyEBEG/view) are online as well. Nor sure when the major release will be or how other languages will be supported, but at least that will be one way to go and it should also be possible to train the model on other languages. – UninformedUser Sep 04 '19 at 03:16
  • ah, yeah and regarding your string based lookup via SPARQL. For DBpedia either exact match: `select ?s ?type where {?s rdfs:label "Bologna"@nl. ?s rdf:type ?type .}` or you use `bif:contains to use the fulltext index for text search. – UninformedUser Sep 04 '19 at 03:19
  • 1
    for Wikidata: `SELECT * WHERE { SERVICE wikibase:mwapi { bd:serviceParam wikibase:api "EntitySearch" . bd:serviceParam wikibase:endpoint "www.wikidata.org" . bd:serviceParam mwapi:search "Bologna" . bd:serviceParam mwapi:language "nl" . ?item wikibase:apiOutputItem mwapi:item . ?num wikibase:apiOrdinal true . } ?item (wdt:P279|wdt:P31) ?type } ORDER BY ASC(?num) LIMIT 20` – UninformedUser Sep 04 '19 at 03:21
  • Thank you a lot @AKSW that YouTube playlist is super interesting. I couldn't get the sparql query to work at the dbpedia.org testing tool, but it gives us something to work with. The Wikidata query seems to work and indeed return what we need. I assume that based on the q numbers you can get the ontology url? – user1923728 Sep 04 '19 at 07:27
  • What do you mean by "ontology URL"? The `?type` is the Wikidata entity resp. class. If you need it's label, add `SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }` and do `SELECT ?item ?type ?typeLabel ` – UninformedUser Sep 04 '19 at 12:35
  • and yes, the DBpedia query isn't working because the label doesn'T exist for `nl` language. Check `select ?s ?type ?l where {?s rdfs:label "Bologna"@en. ?s rdf:type ?type . ?s rdfs:label ?l}` to see all languages loaded into the public endpoint – UninformedUser Sep 04 '19 at 12:36
  • You may also find Virtuoso's Facet Browser interface [over DBpedia](http://dbpedia.org/fct/) to be helpful, [showing here the Types for all entities with "Bologna" in any of their properties](http://dbpedia.org/fct/facet.vsp?qxml=%3C%3Fxml%20version%3D%221.0%22%20encoding%3D%22UTF-8%22%20%3F%3E%3Cquery%20inference%3D%22%22%20invfp%3D%22IFP_OFF%22%20same-as%3D%22SAME_AS_OFF%22%20view3%3D%22%22%20s-term%3D%22%22%20c-term%3D%22%22%20agg%3D%22%22%20limit%3D%2220%22%3E%3Ctext%3EBologna%3C%2Ftext%3E%3Cview%20type%3D%22classes%22%20limit%3D%2220%22%20offset%3D%220%22%20%2F%3E%3C%2Fquery%3E) – TallTed Sep 04 '19 at 19:40
  • Ah yes, you are right! In English I do get a SPARQL result. Thank you for clarifying. I also assumed that SPARQL and Spotlight and WikiData all where kind of the same datasource and just different ways to retrieve the info, but that's incorrect? Since you get different results? – user1923728 Sep 05 '19 at 06:10
  • Thanks @AKSW Demo : http://linkedwiki.com/query/Entity_disambiguation_in_Wikidata – Karima Rafes Sep 14 '19 at 13:31
  • @user1923728 Wikidata and DBpedia are different datasources – UninformedUser Sep 14 '19 at 14:06

0 Answers0