Using the advice from this SO question on how to get the content of random Wikipedia pages, I notice that the API sometimes returns pages without the page body.
This program demonstrates the difference in output that the API can yield.
import json
import requests
r = requests.get('https://en.wikipedia.org/w/api.php?format=json&action=query&generator=random&grnnamespace=0&prop=revisions&rvprop=content&grnlimit=75')
j = json.loads(r.text)
# print(json.dumps(j, indent=4))
for k in j['query']['pages']:
# j['query']['pages'][k]['revisions'] is the field with the page body
content = "has content" if 'revisions' in j['query']['pages'][k] else "no content"
print(content, "\t", j['query']['pages'][k]['title'], "\thttp://en.wikipedia.org/?curid="+str(j['query']['pages'][k]['pageid']))
# print the full json if there's no content
if not 'revisions' in j['query']['pages'][k]:
print(j['query']['pages'][k])
print()
output snippet:
has content Hunjan http://en.wikipedia.org/?curid=8233868
no content Alope (Thessaly) http://en.wikipedia.org/?curid=58260510
{'pageid': 58260510, 'ns': 0, 'title': 'Alope (Thessaly)'}
For pages that show "has content", the j['query']['pages'][k]['revisions']
field is populated, which means the full page body is being returned by the API. For pages that show "no content", that field is absent, and the entire (short) json is dumped. There's no readily apparent reason this should be the case.
Does anyone know why some articles return their body and some don't? Thanks!