I'm not quite sure, if your approach is promising at all. First of all, what I understood what you want to achieve:
- Given you've a page on Wikipedia
- You want to get the corresponding Wikidata item
- You probably want to get other connected pages to this item?
If that's correct, I think your best bet would be the wb_items_per_site
table of Wikidata. Why Wikidata and not Wikipedia? The current architecture of Wikibase (which is the software behind Wikidata) requires access to the client (aka Wikipedia) and the repo (aka Wikidata) database, as the information about the page is saved in the client database, whereas the information about the connected item (including the info, that the page is connected to an item) is saved in the repo database. This information is saved in the wb_items_per_site
table (at least that's the one I would use, I'm not a developer from Wikibase, so this might be not the best solution, either).
E.g., if you want to get the Wikidata item for the Wikipedia article, I would issue the following query:
select * from wb_items_per_site where ips_site_id = "enwiki" and ips_site_page = "Tom Selleck" limit 1;
(note that you need to replace underscores (_
) with whitespaces. This is logic which would be done by the Title
class in MediaWiki. The output would look like:
ips_row_id ips_item_id ips_site_id ips_site_page
540761088 213706 enwiki Tom Selleck
(reference https://quarry.wmflabs.org/query/43884)
To get the other connected pages from this Wikidata item, you can issue a second query:
select * from wb_items_per_site where ips_item_id = 213706;
(see the output here, it's to big to be pasted here :P)
For your "bonus question":
The license information are saved in the wikitext, unfortunately. This means, for images saved on en.wikipedia.org, you need to parse the Wikitext to get the related license information.
One good point here is: Most images are not hosted on en.wikipedia.org, but rather in the Wikimedia Commons project. Over there, there is a project, called structured image data or so, which has the target to make such information (license, title, authors and stuff) available in a structred, machine readable way. Unfortunately, not even near of all of the images and media saved there have these information in a structured way, yet. So, fallback would always be to parse the wikitext.
Wikipedia has an extension installed which partly takes over this parsing part for you. This information is, e.g., used in the MediaViewer feature. The information is available through the api:
https://en.wikipedia.org/w/api.php?action=query&titles=File:Albert%20Einstein%20Head.jpg&prop=imageinfo&iiprop=extmetadata
There you get the License
:
"License": {
"value": "pd",
"source": "commons-templates",
"hidden": ""
}
and the license short name:
"LicenseShortName": {
"value": "Public domain",
"source": "commons-desc-page",
"hidden": ""
}
Unfortunately for you, as I assume you would like to get this info from the dumps, this information is not available there. The information is parsed "on-the-fly" on an API request by the API from the wikitext.