13

I can get the Wikipedia article in XML or any other format. But for a term I want to know first if the returned text contains full article or simply contains ambiguous terms like the entered one.

So "SEO" is an ambiguous(or redirect) term, but how to know this from the results? While "New York" returns complete article.

EDIT

My simple question is, I've 400 city names and I want the wikipedia content of it using API and I don't want those pages which are not city articles but only contain some redirection or other ambiguous terms. I want to discard those.

AgA
  • 2,078
  • 7
  • 33
  • 62
  • Could you link to the actual pages you are talking about? Because as Michael pointed out, the article “SEO” is not a disambiguation page. – svick Mar 13 '12 at 13:14
  • http://en.wikipedia.org/w/api.php?action=parse&page=seo&prop=text|headitems . Also page="new york" is what I don't want but page="New York" gives the correct article and not disambiguous page. – AgA Mar 13 '12 at 13:19
  • Looks like I've got it. I can use this format: http://en.wikipedia.org/w/index.php?action=render&title=kirandul now if the article contains EDIT text within h2 then it is the full article which I'm looking for – AgA Mar 13 '12 at 16:32
  • 1
    I don't think checking for that would work correctly. Don't try to think of hacks and check for the category itself. – svick Mar 13 '12 at 16:38
  • Yes I've realized that's the right way but how to check for article in this category? What'd be it's url? – AgA Mar 13 '12 at 16:43

3 Answers3

11

You can check with the "Disambiguation" ppprop:

http://en.wikipedia.org/w/api.php?action=query&prop=pageprops&ppprop=disambiguation&redirects&format=xml&titles=BNI

user2976654
  • 111
  • 1
  • 3
  • 1
    Note that [link-only answers](http://meta.stackoverflow.com/tags/link-only-answers/info) are discouraged, SO answers should be the end-point of a search for a solution (vs. yet another stopover of references, which tend to get stale over time). Please consider adding a stand-alone synopsis here, keeping the link as a reference. – kleopatra Nov 10 '13 at 18:10
5

All disambiguation pages are in the aptly named category All disambiguation pages, so you can just check for that category.

As an alternative, you could check for the presence of the Disambiguation template, or one of its variants and their redirects.

svick
  • 236,525
  • 50
  • 385
  • 514
  • Are categories returned in the API? And checking for the template, as I suggested, requires retrieving the page-body. – Michael Paulukonis Mar 13 '12 at 13:22
  • Yes, you can use [`prop=categories`](http://en.wikipedia.org/w/api.php?querymodules=categories). – svick Mar 13 '12 at 13:55
  • I must be doing something wrong: https://en.wikipedia.org/w/api.php?action=query&list=search&format=json&srsearch=SEO&srwhat=text&srlimit=2&prop=categories doesn't return any categories – Michael Paulukonis Mar 13 '12 at 14:25
  • That's because `prop=categories` works on pages, not results search. To get the page, you can use the search as a [generator](http://www.mediawiki.org/wiki/API:Query#Generators). The query would then be https://en.wikipedia.org/w/api.php?action=query&generator=search&format=json&gsrsearch=SEO&gsrwhat=text&gsrlimit=2&prop=categories&cllimit=max. – svick Mar 13 '12 at 14:30
  • @svick how can I check for disambiguation template ? Which words identify it. Also if the content contains "
    – AgA Mar 13 '12 at 16:26
  • I think there are both articles without TOC and disambiuation pages with TOC, so that's not a good check. It seems you're looking at the HTML version of the page. In that case, look at ` – svick Mar 13 '12 at 16:37
  • @AgA I check for the presence of a first paragraph that ends with "May refer to:" in my web app. – Luke Taylor Mar 25 '16 at 14:10
1

Update: Disambiguation pages are a content-type of WikiPedia (the installation), and not a page-type in MediaWiki (the software). Thus, the MediaWiki API has no knowledge of what disambiguation pages are, and has not method for retrieving them.

See this related discussion.

Other than the often-but-not-always method I layout below, you would basically have to retrieve the page body, and check for the presence of a disambiguation marker.


The below sometimes works:

When I search for SEO I get: https://en.wikipedia.org/wiki/SEO

Are you referring to disambiguation pages? like https://en.wikipedia.org/wiki/SEO_%28disambiguation%29 ?

If so, check the title for disambiguation.

for instance, the following search : https://en.wikipedia.org/w/api.php?action=query&list=search&format=json&srsearch=SEO&srwhat=text&srlimit=2

yeilds

{
    "query": {
        "searchinfo": {
            "totalhits": 3507
        },
        "search": [
            {
                "ns": 0,
                "title": "Search engine optimization",
                "snippet": "Search engine optimization (<span class='searchmatch'>SEO<\/span>) is the process of improving the visibility of a website  or a web page  in search engine s via the \" <b>...<\/b> ",
                "size": 40468,
                "wordcount": 5269,
                "timestamp": "2012-03-11T11:43:26Z"
            },
            {
                "ns": 0,
                "title": "SEO (disambiguation)",
                "snippet": "<span class='searchmatch'>SEO<\/span>  or search engine optimization, the process of improving ranking in search engine results.  <span class='searchmatch'>SEO<\/span> may also refer to:  <span class='searchmatch'>Seo<\/span> (surname), a  <b>...<\/b> ",
                "size": 955,
                "wordcount": 103,
                "timestamp": "2012-02-22T12:51:20Z"
            }
        ]
    },
    "query-continue": {
        "search": {
            "sroffset": 2
        }
    }
}

You can play around with this @ the Wikipedia API Sandbox.

Michael Paulukonis
  • 9,020
  • 5
  • 48
  • 68
  • That doesn't always work. For example the page [10000](http://en.wikipedia.org/wiki/10000) is disambiguation, but its title doesn't contain that word. And there are many others. – svick Mar 13 '12 at 13:08
  • Crap, that's right: https://en.wikipedia.org/wiki/Wikipedia:Disambiguation#Naming_the_disambiguation_page – Michael Paulukonis Mar 13 '12 at 13:14
  • https://en.wikipedia.org/wiki/Seo this is the page you get for SEO. Or try this API link : http://en.wikipedia.org/w/api.php?action=parse&page=seo&prop=text|headitems – AgA Mar 13 '12 at 13:17
  • @AgA - that's the page you get searching for `Seo`, not `SEO` – Michael Paulukonis Mar 13 '12 at 13:21
  • My simple question is, I've 400 city names and I want the wikipedia content of it and I don't want those pages which are not city articles but only contain some redirection or other ambiguous terms. – AgA Mar 13 '12 at 13:25
  • @Michael Paulukonis if the content contains "
    – AgA Mar 13 '12 at 16:23