2

I would like to make date comparisons between dates given by the Wikidata API.

At first I though to use Python's datetime module, but I bumped into two problems:

  • Wikidata handles dates that span over billions of years in the past or the future in the Julian and Gregorian calendars, datetime only works for Gregorian dates between years 1 and 9999.
  • When the precision is (9) year or lower, the months and day are rendered as "00-00", which datetime.strptime doesn't manage.

For example in this sample query about Paris, this date can be converted to datetime:

datetime.strptime("+1968-01-01T00:00:00Z","+%Y-%m-%dT%H:%M:%SZ")
datetime.datetime(1968, 1, 1, 0, 0)

This one can't:

datetime.strptime("+2012-00-00T00:00:00Z","+%Y-%m-%dT%H:%M:%SZ")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.5/_strptime.py", line 510, in _strptime_datetime
    tt, fraction = _strptime(data_string, format)
  File "/usr/lib/python3.5/_strptime.py", line 343, in _strptime
    (data_string, format))
ValueError: time data '+2012-00-00T00:00:00Z' does not match format '+%Y-%m-%dT%H:%M:%SZ'

Not to mention "-0300-00-00T00:00:00Z" (300 BCE)

I cannot simply use years comparison because for items about things that happened before the common era, there can be several dates in the same negative year.

I'm not too sure about the best way to deal with this. Is there another lib I can use?

Ash_Crow
  • 111
  • 1
  • 7
  • 1
    what do you mean by date comparison ? What would be the result of comparison betweeen "+2012-00-00T00:00:00Z" and "+2012-01-01T00:00:00Z" – Xavier Combelle Jun 21 '17 at 21:49
  • The dates are qualifiers used on claims. Right now, I want the post recent population, but I'd like the method to be more generic. – Ash_Crow Jun 22 '17 at 05:00

1 Answers1

3

tl;dr : datetime can't handle that kind of things, so don't even try. You have strings, keep them and treat them as such.

You could simply sort them as strings, provided they're of consistent length (otherwise pad as needed) and format. This will allow for sorting of "extended" ISO8601:2004 timestamps (as by standard 00 for months and days is not allowed).

Assuming Python3, this code :

import urllib.request,json
url = urllib.request.urlopen("https://www.wikidata.org/w/api.php?action=wbgetentities&format=json&ids=Q90&props=info%7Caliases%7Clabels%7Cdescriptions%7Cclaims%7Cdatatype%7Csitelinks%2Furls&languages=fr&languagefallback=1&formatversion=2")
data = json.loads(url.read().decode())
P6 = sorted(data['entities']['Q90']['claims']['P6'], key=lambda claim: claim['qualifiers']['P580'][0]['datavalue']['value']['time'])
for x in P6:
  print(x['mainsnak']['datavalue']['value']['numeric-id'])

yields this resultset :

1685301
947901
656015
2596877
3131449
1986521
1685102
1684642
601266
677730
289303
959708
2105
1685859
256294
2851133

Additionally, you'll want to separate your list into two :

  • items starting with a - sign
  • items starting with a + sign

Then sort the first list by month-date-time ascending, then by unsigned integer value of the year represented by a string (as sort() and sorted() are guaranteed "stable"), plainly sort the second, and concatenate them back again. This will allow for proper sorting of signed ISO8601 timestamps.

neg = [x for x in P6 if x['qualifiers']['P580'][0]['datavalue']['value']['time'].startswith('-') ]
pos = [x for x in P6 if x['qualifiers']['P580'][0]['datavalue']['value']['time'].startswith('+') ]
neg.sort(key=lambda claim: claim['qualifiers']['P580'][0]['datavalue']['value']['time'][5:])
neg.sort(key=lambda claim: claim['qualifiers']['P580'][0]['datavalue']['value']['time'][1:5])
pos.sort(key=lambda claim: claim['qualifiers']['P580'][0]['datavalue']['value']['time'])
P6sorted = neg+pos

As for the padding, should it be needed, it's trivial enough using string.rjust() (although you'll have to somewhat alter the sorting to reflect the "new" timestamps' length ; string.zfill() is not the right tool for that job, as the string you're altering isn't numeric, having 'T', 'Z', '-', and ':') :

maxlength = max( map( lambda claim: len( claim['qualifiers']['P580'][0]['datavalue']['value']['time'] ), P6 ) )
for claim in P6:
  claim['qualifiers']['P580'][0]['datavalue']['value']['time'] = claim['qualifiers']['P580'][0]['datavalue']['value']['time'][0] + claim['qualifiers']['P580'][0]['datavalue']['value']['time'][1:].rjust(maxlength-1, "0");

neg = [x for x in P6 if x['qualifiers']['P580'][0]['datavalue']['value']['time'].startswith('-') ]
pos = [x for x in P6 if x['qualifiers']['P580'][0]['datavalue']['value']['time'].startswith('+') ]
neg.sort(key=lambda claim: claim['qualifiers']['P580'][0]['datavalue']['value']['time'][maxlength-16:])
neg.sort(key=lambda claim: claim['qualifiers']['P580'][0]['datavalue']['value']['time'][maxlength-22:maxlength-16], reverse=True)
pos.sort(key=lambda claim: claim['qualifiers']['P580'][0]['datavalue']['value']['time'])
P6sorted = neg+pos
for claim in P6sorted:
  print([claim['mainsnak']['datavalue']['value']['id'],claim['qualifiers']['P580'][0]['datavalue']['value']['time']])

As an aside, you may want to "Decorate-Sort-Undecorate" (perform a Schwartzian transform), for readability.

Finally, if you're worried about Julian vs Gregorian calendars, you'll have to convert the Julian dates into Gregorian dates based on country and year by adding the corresponding number of days, and apply the above method. But keep in mind a Julian date (YYYY)-(MM)-(DD) predates a Gregorian date "that seems one day ahead", so it really shouldn't be too much of a worry.

Alphos
  • 46
  • 1