1

I want to get the contents of a Wikipedia page and then do some funny stuff with it.

The idea is that I want to get them in XML/JSON format and at the moment I don't seem to find a way to do it.

For the moment I succeeded in getting this far:

https://en.wikipedia.org/w/api.php?action=query&format=jsonfm&prop=revisions&titles=April_1&rvprop=content&rvcontentformat=text%2Fx-wiki

Bu I receive the content in XWiki and I cannot change it to JSON due to the fact that the page does not support it.

How can I parse the XWiki to a JSON or how can I get the contents of the page.

Thanks!

  • How would you convert the XWiki format to json? How do you expect that output to be if you could really represent it in json? – f1sh Apr 01 '16 at 14:48
  • So if we take as an example the 1st of April page I would like to see it as a tree with the first level of children being Events, Births, Deaths, Holidays and observances, External links and then the children will be year with the event afterwards or just year+event. – Petru Daniel Tudosiu Apr 01 '16 at 14:53
  • 1
    That's not how wikipedia is structured. Each page is simply text. Having a structure inside it is the result of the XWiki markup. If you want to transform that into structured JSON, you will have to write a converter. – f1sh Apr 01 '16 at 15:03
  • Ok. Thanks! I found half of the solution in a html format :-? maybe I can work from there. https://en.wikipedia.org/w/api.php?action=query&format=jsonfm&prop=revisions&titles=April_1&rvprop=content&rvcontentformat=text%2Fx-wiki – Petru Daniel Tudosiu Apr 01 '16 at 15:04

1 Answers1

0

Yes, you can use the HTML parser inside of XWiki Rendering to parse the HTML generated by wikipedia. This gives you an AST on which you can do whatever you wish.

See http://rendering.xwiki.org/xwiki/bin/view/Main/WebHome for more details.

You just need to find a way to get the wikipedia content in HTML.