0

I want to grep all the content of the "United States of America" to a text file without images. I am looking a response in text format.

How can I do that? I got this url constructed: http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=united_states&prop=revisions&rvprop=content

But I am not getting what I want. Maybe I'm missing some basic things.

  1. How can I get the content of whatever string I give in the query? Please help me with the URL.

  2. I am trying to have this in a text file. Can I get the response in text format? Other than XML and JSON?

  3. In the United States example, I want to get the first column of the cities Leading population centers. Is it possible to get that information (or) should I use the parser?

Olli
  • 752
  • 6
  • 20
The Learner
  • 3,867
  • 14
  • 40
  • 50

1 Answers1

2

If you just need the text of the article, action=rawis much simpler than using the API:

http://en.wikipedia.org/wiki/United_States?action=raw&ctype=text/css

or

http://en.wikipedia.org/wiki/United_States?action=raw&ctype=text/css&templates=expand

(ctype=text/css is only important if you want to open it in the browser.)

It is not clear what you are talking about in point 3, but if you want to extract data from tables, your best bet is probably getting the rendered (HTML) content and using some sort of DOM parser (and keep half an eye on Wikidata which will make things much simpler within a few months).

Tgr
  • 27,442
  • 12
  • 81
  • 118