-2

I am writing an "auto-wikifier" tool using HTML and JavaScript. For each word in the text to be wikified, I need to obtain a list of pages that contain that word (so that the matching phrases in the text can be automatically wikified, if they are found). Is there a way to obtain a list of all Wikipedia pages that contain a specific word, using one of Wikipedia's APIs or web services?

function getMatchingPageTitles(theString){
    //get a list of all matching page titles for a specific string, using one of Wikipedia's APIs or web services
}
Hugolpz
  • 17,296
  • 26
  • 100
  • 187
Anderson Green
  • 30,230
  • 67
  • 195
  • 328

2 Answers2

6

First, I'm not sure I understand how would something like that be useful. (Wikipedia has articles for all the common words and I don't think links to them would be of any use.)

But if you really wanted to do something like this, I think a much better way would be to use the API to find out which words from your input text have articles.

For example, for the string I am writing an "auto-wikifier" tool, your query could look something like:

http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=I|am|writing|an|auto-wikifier|tool

And the answer is:

<api>
  <query>
    <normalized>
      <n from="am" to="Am" />
      <n from="writing" to="Writing" />
      <n from="an" to="An" />
      <n from="auto-wikifier" to="Auto-wikifier" />
      <n from="tool" to="Tool" />
    </normalized>
    <pages>
      <page ns="0" title="Auto-wikifier" missing="" />
      <page pageid="2513432" ns="0" title="Am" />
      <page pageid="2513422" ns="0" title="An" />
      <page pageid="25346998" ns="0" title="I" />
      <page pageid="30677" ns="0" title="Tool" />
      <page pageid="32977" ns="0" title="Writing" />
    </pages>
  </query>
</api>

Few notes:

  • The results are not in the order you specified them.
  • If a page doesn't exist, the result has missing="" attribute.
  • JSON and JSONP formats are available too, that might be more suitable for JavaScript.
  • The titles parameter has a limit of 50 per one query.
svick
  • 236,525
  • 50
  • 385
  • 514
  • 3
    +1 for a solution which doesn't involve bombarding a non-profit site with a bunch of pointless traffic. – Tim M. Jan 22 '13 at 18:30
  • Would there be a way to find all pages with titles that contain a certain word (instead of being an exact match of that word)? – Anderson Green Jan 22 '13 at 23:13
  • 1
    You could try something like https://en.wikipedia.org/w/api.php?format=xml&action=query&list=search&srsearch=intitle:tool&srprop=&srlimit=max, but that would mean one query for each word. – svick Jan 22 '13 at 23:48
  • @svick You made a very good point: It would not be useful to wikify every single word that has an article on Wikipedia. Instead, it would be better to obtain a list of page titles that are contained in the input, and then ask the user whether to wikify each title (with the titles being wikified in descending order). This would avoid the inherent redundancy of fully automatic wikification, since it would allow the user to choose specific phrases to wikify. – Anderson Green Jan 23 '13 at 21:27
  • I noticed that there is a limit of 50 results for each query. Is there a way to get all the page titles and redirect pages that contain a specific word (instead of just 50 of them)? – Anderson Green Feb 16 '13 at 23:39
1

The API:Allpages is an interesting start. Sadly, it is limited to 500 queries

Hugolpz
  • 17,296
  • 26
  • 100
  • 187
  • Is it possible to make search for matches of a whole word? (I noticed that [this query](http://en.wikipedia.org/w/api.php?action=query&list=allpages&apfrom=Kre&aplimit=500) shows all pages that contain `Kre`, not just `Kre` as a single word. Is it possible to show results for the whole word only?) – Anderson Green Feb 16 '13 at 23:51
  • I think your trouble actually is to add spaces around Kre, within a PHP query. – Hugolpz Feb 17 '13 at 00:02
  • Note: I also just start to dig in the mediawiki API. I guess it's a general policy to limit queries to 500, so I will head toward dbpedia & sparql queries. – Hugolpz Feb 17 '13 at 00:03
  • I added spaces around `Kre`, and I'm still getting results like `Kreamer` and `Kreacher`. It would be better if the results only contained `Kre` as an isolated word. http://en.wikipedia.org/w/api.php?action=query&list=allpages&apfrom=%20Kre%20&aplimit=500 – Anderson Green Feb 17 '13 at 00:16
  • Isolated spaces seems to be automatically discarded, I already noticed this behaviour in previous cases. Search online or open a SO question about this space issue. – Hugolpz Feb 17 '13 at 00:20