I want to reconcile a large number of records, of which I have the exact Wikipedia article titles (including parenthetical disambiguation). What is the best/fastest way to match this large number of records based on their exact Wikipedia title in OpenRefine? If I simply reconcile by text, the confidence is low and Wikidata entries with the same title get mixed up.
2 Answers
Transform your values into Wikipedia URLs, for instance with the following GREL formula (assuming all articles are on the English Wikipedia):
'https://en.wikipedia.org/wiki/'+value
You can then reconcile this column with the Wikidata reconciliation service, which will recognize these URLs and resolve the Wikidata items via site links.
If your article titles contain disambiguation pages, reconciliation will give you disambiguation items, so it is a good practice to double-check their type (P31
) by fetching it after reconciliation.

- 2,293
- 1
- 18
- 26
-
This was exactly what I needed. Thanks pintoch! With `'https://en.wikipedia.org/wiki/'+escape(value, 'url')` I was able to reconcile every single article! – CennoxX May 07 '20 at 14:07
I think you are approaching from the opposite direction. Use @Wikidata numbers, which are also available for the disambiguation pages! The Wikidata item is on the left side pane. It provides disambiguation and is language neutral and queryable. Every Wikipedia entry has a Wikidata entry.
There might also be a SPARQL query that would do this work for you. If you ask some of the Wikidatans they can help. Try @wikidatafacts on Twitter.
If you need non-linked text included, which might be in some of the disamb page list, the manual nature of Wikipedia won’t help you. But you could spot check for those outliers.

- 9
- 1
-
As said I have a large number of Wikipedia articles (from a full text search). Getting the Wikidata ID manually from the Wikipedia page would be rather inefficient for that. A SPARQL request might synchronize the data, but then I would have to enter the data back into OpenRefine (my records contains more than just the Wikipedia article title column), which would be another time consuming step... – CennoxX May 07 '20 at 13:48