1

Using the advice from this SO question on how to get the content of random Wikipedia pages, I notice that the API sometimes returns pages without the page body.

This program demonstrates the difference in output that the API can yield.

import json
import requests
 
r = requests.get('https://en.wikipedia.org/w/api.php?format=json&action=query&generator=random&grnnamespace=0&prop=revisions&rvprop=content&grnlimit=75')
j = json.loads(r.text)
# print(json.dumps(j, indent=4))
 
for k in j['query']['pages']:
    # j['query']['pages'][k]['revisions'] is the field with the page body
    content = "has content" if 'revisions' in j['query']['pages'][k] else "no content"
 
    print(content, "\t", j['query']['pages'][k]['title'], "\thttp://en.wikipedia.org/?curid="+str(j['query']['pages'][k]['pageid']))
 
    # print the full json if there's no content
    if not 'revisions' in j['query']['pages'][k]:
        print(j['query']['pages'][k])
 
    print()

output snippet:

has content      Hunjan         http://en.wikipedia.org/?curid=8233868

no content       Alope (Thessaly)       http://en.wikipedia.org/?curid=58260510
{'pageid': 58260510, 'ns': 0, 'title': 'Alope (Thessaly)'}

For pages that show "has content", the j['query']['pages'][k]['revisions'] field is populated, which means the full page body is being returned by the API. For pages that show "no content", that field is absent, and the entire (short) json is dumped. There's no readily apparent reason this should be the case.

Does anyone know why some articles return their body and some don't? Thanks!

1 Answers1

1

Honestly, I don't fully comprehend the mediawiki API design myself, but I'm going to give you a few hints that I believe will help.

First, reduce the grnlimit to 50 and the issue should disappear.

API:Query: pageids: "Maximum number of values is 50 (500 for clients allowed higher limits)."

What I believe is happening here is automatic pagination by the API. I bet you are not getting the batchcomplete message from the API.

The API returns a batchcomplete element to indicate that all data for the current batch of items has been returned.

(https://www.mediawiki.org/wiki/API:Query#Response_6)

What you need to do, if you don't want to change the number of results, is to continue your query. See https://www.mediawiki.org/wiki/API:Query#Example_4:_Continuing_queries for how. Then you'll see the pages that don't have revisions will reappear in the following continuations with full revisions displayed.

I don't know why they have designed it this way, but that's the way it is.

AXO
  • 8,198
  • 6
  • 62
  • 63
  • Thanks! Do you know why this query returns a continue block? If I try to pass the contents of the continue block back into my query as example 4 in the documentation suggests I get an endless cycle of continue blocks no matter how many real entries there are. [link](https://en.wikipedia.org/w/api.php?format=json&action=query&generator=random&prop=revisions&rvprop=content&rvslots=*&grnlimit=2) – Lava Salesman Nov 23 '22 at 19:50
  • @LavaSalesman The continue block means the result is too large and will be returned in multiple requests. You should at least continue until you get the first `batchcomplete` message, then you can combine the given responses to get a complete result set. Yes, the continues will be almost endless, the random generator is yielding all wikipedia pages and it will continue until all of them are returned. The `grnlimit` only specifies how many pages will be passed to the revisions module in each batch. – AXO Nov 23 '22 at 22:08