Wikipedia Page Id from URL

Question

I am parsing through wikipedia dump in java. In my module I want to know the page id of the internal pages of wiki those are referred by the current page. Getting the internal links and thus the url from it is easy. But how to get Page ID from url.

Do I have to use some mediaWiki for this? If yes how Any other alternative?

for eg: http://en.wikipedia.org/wiki/United_States I want to get its Page-Id i.e 3434750

If Wikipedia doesn't provide an API for you to retrieve this info, looks like you will need to build some automation into your "crawler" to go into each page and retrieve the ID you want (You can try Selenium/HTMLUnitDriver). — the_marcelo_r, Mar 20 '14 at 18:04
Start out with the [Wikipedia API](http://en.wikipedia.org/w/api.php). From the page source, it appears this ID is `wgArticleId` in `mw.config.set`, but I am unsure of how to pull that from the API. — admdrew, Mar 20 '14 at 18:08

score 7 · Accepted Answer · answered Mar 20 '14 at 23:17

You can use the API for that. Specifically, the query would look something like:

http://en.wikipedia.org/w/api.php?action=query&titles=United_States

(You can also specify more than one page title in the titles parameter, separated by |.)

As an alternative, you could download the page.sql dump (1 GB compressed for the English Wikipedia), which also contains this information. To actually query it, you could either import it into an MySQL database and then query that, or you could directly parse the SQL.

score 1 · Answer 2 · answered Mar 27 '14 at 17:34

1

If you can't use the api you can always get the pageID from the info page reached by appending ?action=info to the url. Should make a better starting point for a parser.

For your example above: https://en.wikipedia.org/wiki/United_States?action=info

answered Mar 27 '14 at 17:34

Lokal_Profil

384
1
13

Wikipedia Page Id from URL

2 Answers2