1

I'm doing a research project for the summer and I've got to use get some data from Wikipedia, store it and then do some analysis on it. I'm using the Wikipedia API to gather the data and I've got that down pretty well.

What my questions is in regards to the links-alllinks option in the API doc here After reading the description, both there and in the API itself (it's down and bit and I can't link directly to the section), I think I understand what it's supposed to return. However when I ran a query it gave me back something I didn't expect.

Here's the query I ran:

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=google&rvprop=ids|timestamp|user|comment|content&rvlimit=1&list=alllinks&alunique&allimit=40&format=xml

Which in essence says: Get the last revision of the Google page, include the id, timestamp, user, comment and content of each revision, and return it in XML format. The allinks (I thought) should give me back a list of wikipedia pages which point to the google page (In this case the first 40 unique ones).

I'm not sure what the policy is on swears, but this is the result I got back exactly:

<?xml version="1.0"?>
<api>
    <query><normalized>
        <n from="google" to="Google" />
        </normalized>
        <pages>
            <page pageid="1092923" ns="0" title="Google">
                <revisions>
                    <rev revid="366826294" parentid="366673948" user="Citation bot" timestamp="2010-06-08T17:18:31Z" comment="Citations: [161]Tweaked: url. [[User:Mono|Mono]]" xml:space="preserve">
                        <!-- The page content, I've replaced this cos its not of interest -->
                    </rev>
                </revisions>
            </page>
        </pages>
        <alllinks>
                <!-- offensive content removed -->
        </alllinks>
    </query>
    <query-continue>
        <revisions rvstartid="366673948" />
        <alllinks alfrom="!2009" />
    </query-continue>
</api>

The <alllinks> part, its just a load of random gobbledy-gook and offensive comments. No nearly what I thought I'd get. I've done a fair bit of searching but I can't seem to find a direct answer to my question.

  1. What should the list=alllinks option return?
  2. Why am I getting this crap in there?
svick
  • 236,525
  • 50
  • 385
  • 514
Chris Salij
  • 3,096
  • 5
  • 26
  • 43
  • 1
    1) It sounds like you downloaded a page that was vandalized exactly at that moment. 2) I would love to be able to do some analysis on Wikipedia using R. What analysis tool are you using ? – Tal Galili Jun 24 '10 at 20:24
  • I'm not using any :P I've written it all myself as part of my research internship. There doesn't seem to be any decent Ruby code out there for scraping Wikipedia. I'm at the stage of writing the analysis code now. – Chris Salij Jun 25 '10 at 07:26
  • Try http://rubygems.org/gems/mediawiki-gateway, and if it's not decent enough, let me know why ;) – lambshaanxy Nov 08 '10 at 23:12

1 Answers1

2

You don't want a list; a list is something that iterates over all pages. In your case you simply "enumerate all links that point to a given namespace".

You want a property associated with the Google page, so you need prop=links instead of the alllinks crap.

So your query becomes: http://en.wikipedia.org/w/api.php?action=query&prop=revisions|links&titles=google&rvprop=ids|timestamp|user|comment|content&rvlimit=1&format=xml

Bryan
  • 692
  • 5
  • 9