1

I have a text file with a list of famous names compiled from disparate sources that I would like to normalize, so that I can accurately collate them. For example, the list includes variations such as Lao-tse, Lao Tzu, and Lao Zi; but, all of those ultimately represent Laozi. What are my options to normalize/canonicalize the names?

One thing I've noticed is that if you try putting those variations directly into a Wikipedia URL, they all ultimately redirect to the same page (Lao-tse, Lao Tzu, Lao Zi). Is there perhaps a Wikidata API to query for those redirects or is there a simple way to capture the canonical term from the redirect behavior?

Matt V.
  • 9,703
  • 10
  • 35
  • 56

3 Answers3

1

On Laozi item https://www.wikidata.org/wiki/Q9333 there is the list of "also known as" those are the skos:altLabel and are called alias. You can query "Lao Tzu" for example like this:

SELECT DISTINCT ?s ?sLabel ?sAltLabel WHERE {
?s wdt:P31 wd:Q5.
OPTIONAL { ?s skos:altLabel ?sAltLabel. }
FILTER(CONTAINS(?sAltLabel, "Lao Tzu"@en)) 
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
} limit 100 

When I ran it in https://query.wikidata.org/ I got a out of time exception though. so I added ?s wdt:P106 wd:Q4964182. (occupation philosopher). (maybe queries in scripts don't get out if time exception).

Also, see here: https://www.wikidata.org/wiki/Help:Aliases it is written

"Multiple items can have the same alias, so long as they have different descriptions."

so this should also be considered.

dafnahaktana
  • 837
  • 7
  • 21
1

You can use one of the MediaWiki APIs in this form.

https://en.wikipedia.org/w/api.php?action=query&titles=Laozi&prop=redirects

It has some obvious defects in actual use, some of which are mentioned in the page that's returned when you actually try it out.

I have also recently discovered that you can use Python to exercise the API, in the following way.

from mwclient.client import Site
from mwclient.page import Page

site = Site('en.wikipedia.org')
result = site.api('query', titles='Laozi', prop='redirects', rdlimit=10)
for value in result['query']['pages']:
    for item in result['query']['pages'][value]['redirects']:
        print (item['title'])

The first ten redirects are:

Lao tzu
Lao tse
Lao zi
Lao Tse
Li Er
Lao-Tzu
Lao Tze
Lao-tze
Lao-tzu
Lao Tsze

However, there seem to be quite a few more. Mind you, I could be wrong about this.

Bill Bell
  • 21,021
  • 5
  • 43
  • 58
0

I ended up going with a good-enough/quick-and-dirty solution using curl to fetch the Wikipedia page for each entry and pup to grap the resulting page title:

cat NAMES.txt | xargs -I % -d"\n" sh -c 'curl -s "https://en.wikipedia.org/w/index.php" --data-urlencode "title=%" | pup -p "#firstHeading text{}"' > NAMES.canonicalized.txt
Matt V.
  • 9,703
  • 10
  • 35
  • 56