I retrieve the list of pages for a given category using Wikipedia API. However the pages are represented by their page_id. How can I get a page actual textual content by its page_id using Wikipedia API
2 Answers
AFAIK there is no direct way of getting the text of a wiki page from the pageid However there are a couple of workarounds
Getting URL and then parsing
get the URL of the wikipage by making an API call like
http://en.wikipedia.org/w/api.php?action=query&prop=info&pageids=<your_pageid_here>&inprop=url
then go to the URL and parse the text
Get pagename and then the content
Wikipedia API allows extraction of text if the pagename is known. But asyou know only the pageid for now, you will need to convert the pageid into pagename by using an API call like
http://en.wikipedia.org/w/api.php?action=query&pageids=<your_pageid_here>&format=json
This will give you the pagename, then you can make another API call to get the contents
http://en.wikipedia.org/w/api.php?action=parse&prop=text&page=<your_pagename_here>&format=json

- 1,079
- 1
- 7
- 17
-
It looks like if I use the content from the api you mentioned, it returns all html tags as well. Is there any way to use export feature to only text? https://en.wikipedia.org/wiki/Special:Export – HHH Jul 16 '15 at 20:51
-
According to en.wikipedia.org/wiki/Special:Export it seems that it is used for MediaWiki migrations and exports in form of an XML. Theres no way you can get raw text from the API. You can parse HTML to plaintext pretty easily with html parsers. I personally like to use [Jsoup](http://jsoup.org/) – Shreyas Chavan Jul 16 '15 at 21:02
You can do that by adding a hyperlink like this and there you need a pageid that you will get from API.
href=http://en.wikipedia.org/?curid=${pageid}
.
So the final link be like https://en.wikipedia.org/?curid=13673345[1]

- 37
- 1
- 5