The Wikipedia iwlinks table only stores some links to Wikidata pages. Where are the others?

Question

I'm using the Wikipedia dumps extracts to process Wikipedia instead of the Wikipedia API because I'd like to run a lot of queries quickly.

I'd like to connect Wikipedia pages to their respective Wikidata pages. My understanding is the iwlinks table contains this information. However, although I've been able to verify this for some Wikipedia pages, I've also been able to verify that it's not the case for others.

For example, if we look up Metallica's Wikipedia page in the iwlinks table, we get:

iwl_from, iwl_prefix, iwl_title
'18787', 'c', 'Special:Search/Metallica'
'18787', 'd', 'Q15920'
'18787', 'q', 'Special:Search/Metallica'

Where the row containing 'd' in the iwl_namespace column contains information about where to find the Metallica Wikidata page (i.e. Q15920).

However, if we lookup the iwlinks table for Tom Selleck's Wikipedia page using:

SELECT * FROM iwlinks WHERE iwl_from = 277451;

we get:

iwl_from, iwl_prefix, iwl_title
'277451', 'commons', 'Tom_Selleck'
'277451', 'q', 'Special:Search/Tom_Selleck'

Neither of these rows contain information about his his Wikidata page. However, his Wikipedia page contains a "Wikidata item" link to his Wikidata page, so presumably it must be stored somewhere, but I can't find it.

I'd greatly appreciate any suggestions you can think of.

P.S. Bonus points if you can point me in the right direction to figure out where the licence information is stored for each image in Wikipedia.

score 3 · Answer 1 · answered Apr 13 '20 at 10:13

I'm not quite sure, if your approach is promising at all. First of all, what I understood what you want to achieve:

Given you've a page on Wikipedia
You want to get the corresponding Wikidata item
You probably want to get other connected pages to this item?

If that's correct, I think your best bet would be the wb_items_per_site table of Wikidata. Why Wikidata and not Wikipedia? The current architecture of Wikibase (which is the software behind Wikidata) requires access to the client (aka Wikipedia) and the repo (aka Wikidata) database, as the information about the page is saved in the client database, whereas the information about the connected item (including the info, that the page is connected to an item) is saved in the repo database. This information is saved in the wb_items_per_site table (at least that's the one I would use, I'm not a developer from Wikibase, so this might be not the best solution, either).

E.g., if you want to get the Wikidata item for the Wikipedia article, I would issue the following query:

select * from wb_items_per_site where ips_site_id = "enwiki" and ips_site_page = "Tom Selleck" limit 1;

(note that you need to replace underscores (_) with whitespaces. This is logic which would be done by the Title class in MediaWiki. The output would look like:

ips_row_id     ips_item_id     ips_site_id     ips_site_page
540761088      213706          enwiki          Tom Selleck

(reference https://quarry.wmflabs.org/query/43884)

To get the other connected pages from this Wikidata item, you can issue a second query:

select * from wb_items_per_site where ips_item_id = 213706;

(see the output here, it's to big to be pasted here :P)

For your "bonus question":

The license information are saved in the wikitext, unfortunately. This means, for images saved on en.wikipedia.org, you need to parse the Wikitext to get the related license information.

One good point here is: Most images are not hosted on en.wikipedia.org, but rather in the Wikimedia Commons project. Over there, there is a project, called structured image data or so, which has the target to make such information (license, title, authors and stuff) available in a structred, machine readable way. Unfortunately, not even near of all of the images and media saved there have these information in a structured way, yet. So, fallback would always be to parse the wikitext.

Wikipedia has an extension installed which partly takes over this parsing part for you. This information is, e.g., used in the MediaViewer feature. The information is available through the api: https://en.wikipedia.org/w/api.php?action=query&titles=File:Albert%20Einstein%20Head.jpg&prop=imageinfo&iiprop=extmetadata

There you get the License:

"License": {
    "value": "pd",
    "source": "commons-templates",
    "hidden": ""
}

and the license short name:

"LicenseShortName": {
    "value": "Public domain",
    "source": "commons-desc-page",
    "hidden": ""
}

Unfortunately for you, as I assume you would like to get this info from the dumps, this information is not available there. The information is parsed "on-the-fly" on an API request by the API from the wikitext.

Parsing license information is not that hard, the markup patterns to look for are [documented](https://commons.wikimedia.org/wiki/Commons:Machine-readable_data). You need to parse the HTML though, not the wikitext, and there aren't any HTML dumps. — Tgr, Apr 14 '20 at 08:54
@Florian Thanks heaps for all your suggestions. That helped me understand the underlying data structures. I appreciate you taking the time to explain that. — Brad Wilcox, Apr 14 '20 at 09:04

score 2 · Accepted Answer · answered Apr 14 '20 at 08:59

You can find the wikidata item in the page_props table. iwlinks contains the links which appear in the text (look at the bottom of the Metallica article, you'll see a little sister project box, which is just a wikitext template; that's what generated those iwlinks entries). The links on the sidebar used to come from langlinks, but Wikidata has largely replaced the system of interlanguage links so now those associations are stored on Wikidata instead.

Thank you!! That solves it. I appreciate your help. – Brad Wilcox Apr 14 '20 at 09:05 — Brad Wilcox, Apr 14 '20 at 09:05

The Wikipedia iwlinks table only stores some links to Wikidata pages. Where are the others?

2 Answers2

For your "bonus question":