0

Hi everyone I'm looking to extract the value of "wikibase_item" for every article in Wikipedia by using wiki dump via bz2 (that I already downloaded). example for what value I want to achieve ("Q2263"):

{"batchcomplete":"","query":{"pages":{"43568":{"pageid":43568,"ns":0,"title":"Tom Hanks","pageprops":{"defaultsort":"Hanks, Tom","page_image_free":"Tom_Hanks_TIFF_2019.jpg","wikibase-shortdesc":"American actor and film producer","wikibase_item":"Q2263"}}}}}

That example provided by query to the API (Which I don't want to do).

I tried to open the xml file that in the bz2 file and find (ctrl-f) for "wikibase_item" or the value of specific entity that in there and I didn't get nothing. I wondering if there any option to get this value from the wiki dump at all? and if there is another options to get this I would like to hear about it?

Note - my code is taken from this github: https://github.com/jeffheaton/present/tree/master/youtube/wikipedia/process that code providing "id" of article which isn't the same in different language, that's why I want to get "wikibase_item" value.

Any comment will be appreciate, Thanks!

  • it is normal dictionary - you can use recursion to check every element in dictionary. OR you can convert to string and search `"wikibase_item":` using normal `find()` (and later search closing `"`) or you can use `regex` – furas Oct 20 '22 at 19:35
  • @furas I think you talking about the API result when you say dictionary. I am talking about the xml file in wikipedia dump and there is no `find()` in there. – assaf bitton Oct 21 '22 at 10:27
  • python has many modules to work with XML/HTML - ie. popular external modules `lxml`, `BeautifulSoup` - and standard modules (see [documentatiom](https://docs.python.org/3/library/markup.html)) ie. `xml.etree` – furas Oct 21 '22 at 10:52
  • @furas ok but my question is not about how to read xml file.. I am looking for some explanation how to get the value of wikibase item from wiki dump. – assaf bitton Oct 21 '22 at 18:23

0 Answers0