3

I am parsing through wikipedia dump in java. In my module I want to know the page id of the internal pages of wiki those are referred by the current page. Getting the internal links and thus the url from it is easy. But how to get Page ID from url.

Do I have to use some mediaWiki for this? If yes how Any other alternative?

for eg: http://en.wikipedia.org/wiki/United_States I want to get its Page-Id i.e 3434750

MrTambourineMan
  • 1,025
  • 1
  • 11
  • 19
  • Where is the page id specified in the page? – christopher Mar 20 '14 at 18:02
  • If Wikipedia doesn't provide an API for you to retrieve this info, looks like you will need to build some automation into your "crawler" to go into each page and retrieve the ID you want (You can try Selenium/HTMLUnitDriver). – the_marcelo_r Mar 20 '14 at 18:04
  • Start out with the [Wikipedia API](http://en.wikipedia.org/w/api.php). From the page source, it appears this ID is `wgArticleId` in `mw.config.set`, but I am unsure of how to pull that from the API. – admdrew Mar 20 '14 at 18:08
  • I guess I will have to do some parsing to retreive it – MrTambourineMan Mar 20 '14 at 19:44

2 Answers2

7

You can use the API for that. Specifically, the query would look something like:

http://en.wikipedia.org/w/api.php?action=query&titles=United_States

(You can also specify more than one page title in the titles parameter, separated by |.)

As an alternative, you could download the page.sql dump (1 GB compressed for the English Wikipedia), which also contains this information. To actually query it, you could either import it into an MySQL database and then query that, or you could directly parse the SQL.

svick
  • 236,525
  • 50
  • 385
  • 514
1

If you can't use the api you can always get the pageID from the info page reached by appending ?action=info to the url. Should make a better starting point for a parser.

For your example above: https://en.wikipedia.org/wiki/United_States?action=info

Lokal_Profil
  • 384
  • 1
  • 13