4

I retrieve the list of pages for a given category using Wikipedia API. However the pages are represented by their page_id. How can I get a page actual textual content by its page_id using Wikipedia API

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
HHH
  • 6,085
  • 20
  • 92
  • 164

2 Answers2

4

AFAIK there is no direct way of getting the text of a wiki page from the pageid However there are a couple of workarounds

Getting URL and then parsing get the URL of the wikipage by making an API call like http://en.wikipedia.org/w/api.php?action=query&prop=info&pageids=<your_pageid_here>&inprop=url

then go to the URL and parse the text

Get pagename and then the content

Wikipedia API allows extraction of text if the pagename is known. But asyou know only the pageid for now, you will need to convert the pageid into pagename by using an API call like

http://en.wikipedia.org/w/api.php?action=query&pageids=<your_pageid_here>&format=json

This will give you the pagename, then you can make another API call to get the contents

http://en.wikipedia.org/w/api.php?action=parse&prop=text&page=<your_pagename_here>&format=json

Shreyas Chavan
  • 1,079
  • 1
  • 7
  • 17
  • It looks like if I use the content from the api you mentioned, it returns all html tags as well. Is there any way to use export feature to only text? https://en.wikipedia.org/wiki/Special:Export – HHH Jul 16 '15 at 20:51
  • According to en.wikipedia.org/wiki/Special:Export it seems that it is used for MediaWiki migrations and exports in form of an XML. Theres no way you can get raw text from the API. You can parse HTML to plaintext pretty easily with html parsers. I personally like to use [Jsoup](http://jsoup.org/) – Shreyas Chavan Jul 16 '15 at 21:02
1

You can do that by adding a hyperlink like this and there you need a pageid that you will get from API. href=http://en.wikipedia.org/?curid=${pageid} . So the final link be like https://en.wikipedia.org/?curid=13673345[1]

Samiul Karim
  • 37
  • 1
  • 5