0

Consider this list:

dates = [
    ('2015-02-03', 'name1'),
    ('2015-02-04', 'nameg'),
    ('2015-02-04', 'name5'),
    ('2015-02-05', 'nameh'),
    ('1929-03-12', 'name4'),
    ('2023-07-01', 'name7'),
    ('2015-02-07', 'name0'),
    ('2015-02-08', 'nameh'),
    ('2015-02-15', 'namex'),
    ('2015-02-09', 'namew'),
    ('1980-12-23', 'name2'),
    ('2015-02-12', 'namen'),
    ('2015-02-13', 'named'),
]

How can I identify those dates that are out of sequence. I don't care if they repeat, or skip, I just need the ones way out of line. Ie, I should get back:

('1929-03-12', 'name4'),
('2023-07-01', 'name7'),
('2015-02-15', 'namex'),
('1980-12-23', 'name2'),

Namex is less obvious, but it's not in the general order of the list.

My simplistic start (which I have deleted to simplify the question) is obviously woefully incomplete.


Update: Based on the comments, it seems an implementation of the Longest Increase Subsequence (LIS) will get me started, a python implementation found here:

Seems once I get the LIS, I can compare it to the original list and see where the gaps are... Fascinating. SO is the hive-mind of awesomeness.

Community
  • 1
  • 1
Trees4theForest
  • 1,267
  • 2
  • 18
  • 48

2 Answers2

1

Short answer, general solution

Using my answer to the "Longest increasing subsequence" question, this could be implemented simply as:

def out_of_sequence(seq):
  indices = set(longest_subsequence(seq, 'weak', key=lambda x: x[0], index=True))
  return [e for i, e in enumerate(seq) if i not in indices]

Longer answer, specific solution

Based on the question at Code Review and a question about non-decreasing sequences (since that's what you're after), here's a solution to your problem:

from bisect import bisect_right
from operator import itemgetter


def out_of_sequence(seq, key = None):
  if key is None: key = lambda x: x 

  lastoflength = [0] # end position of subsequence with given length
  predecessor = [None] # penultimate element of l.i.s. ending at given position

  for i in range(1, len(seq)):
    # find length j of subsequence that seq[i] can extend
    j = bisect_right([key(seq[k]) for k in lastoflength], key(seq[i]))
    # update old subsequence or extend the longest
    try: lastoflength[j] = i
    except: lastoflength.append(i)
    # record element preceding seq[i] in the subsequence for backtracking
    predecessor.append(lastoflength[j-1] if j > 0 else None)

  indices = set()
  i = lastoflength[-1]
  while i is not None:
    indices.add(i)
    i = predecessor[i]

  return [e for i, e in enumerate(seq) if i not in indices]


print(*out_of_sequence(dates, itemgetter(0)), sep='\n')

Outputs:

('1929-03-12', 'name4')
('2023-07-01', 'name7')
('2015-02-15', 'namex')
('1980-12-23', 'name2')

The key parameter (inspired by sorted builtin) specifies a function of one argument that is used to extract a comparison key from each list element. The default value is None so the caller has a convenient way of saying "I want to compare the elements directly". If it is set to None we use lambda x: x as an identity function, so the elements are not changed in any way before the comparison.

In your case, you want to use the dates as keys for comparison, so we use itemgetter(0) as key. And itemgetter(1) would use the names as key, see:

>>> print(*map(itemgetter(1), dates))
name1 nameg name5 nameh name4 name7 name0 nameh namex namew name2 namen named

Using itemgetter(k) is equivalent to lambda x: x[k]:

>>> print(*map(lambda x: x[1], dates))
name1 nameg name5 nameh name4 name7 name0 nameh namex namew name2 namen named

Using it with map is equivalent to a generator expression:

>>> print(*(x[1] for x in dates))
name1 nameg name5 nameh name4 name7 name0 nameh namex namew name2 namen named

But if we used a similar list comprehension to pass the sequence to out_of_sequence we would get a different result from expected:

>>> print(*out_of_sequence([x[0] for x in dates]), sep='\n')
1929-03-12
2023-07-01
2015-02-15
1980-12-23

Likewise, if we compare the date-name pairs directly we get wrong results (because 'nameg' compares greater to 'name5'):

>>> print(*out_of_sequence(dates), sep='\n')
('2015-02-04', 'nameg')
('1929-03-12', 'name4')
('2023-07-01', 'name7')
('2015-02-15', 'namex')
('1980-12-23', 'name2')

Because we want to return dates and names, and we want to order by dates only, we need to pass a function that extracts dates using the key parameter.

An alternative would be to get rid of key and just write:

j = bisect_right([seq[k][0] for k in lastoflength], seq[i][0])

But since this is stackoverflow, maybe one day another person will come by this answer and will need some other key extraction, therefore I decided to post the more general solution here.

Community
  • 1
  • 1
arekolek
  • 9,128
  • 3
  • 58
  • 79
  • 1
    @Trees4theForest you can see all edits if you click the "edited X time ago" link. But in this case, I changed only some comments and variable names, nothing of real significance. – arekolek Jul 10 '16 at 12:27
  • been testing regularly, and it works -- awesome! For my understanding, what is the key=None arg (and subsequent lambda x: x) doing through this – Trees4theForest Jul 12 '16 at 00:34
  • I've added an explanation to my answer. – arekolek Jul 12 '16 at 15:32
  • Hi @arekolek - I can't thank you enough for the work, and the explanation. However, I did just find one limitation: If the first run of *out-of-sequence* items is longer than the first run of *in-sequence* items, then regardless of how many in-sequence items follow, the function returns that initial run of in-sequence items. IE 1,2,3,991,992,993,994,5,6,7,8,9,10,11... returns: 1,2,3 – Trees4theForest Jul 13 '16 at 10:35
  • That's interesting, but also weird because [it returns 991,992,993,994 for me](http://ideone.com/hS2Rvg). But if you are sure there's a problem, I think it's best to post a new question. – arekolek Jul 13 '16 at 11:43
  • Try the reverse: 991, 992, 993, 1, 2, 3, 4, 5, 994, 995, 996, 997, 998, 999 – Trees4theForest Jul 13 '16 at 14:45
  • Therefore when you say "out of sequence", the word *sequence* does not refer to the *longest nondecreasing subsequence*, since `1 2 3 4 5 994 995 996 997 998 999` is longer than `991 992 993 994 995 996 997 998 999`, so the answer `991 992 993` would satisfy you if that was the case. You need to update your question (or post a new one) and include a better definition of the *sequence* to which the elements shall not belong. I found two problems related to LIS, one is Range-Constrained LIS, and the other is Slope-Constrained LIS. See if the formulation of any of them meets your criteria. – arekolek Jul 13 '16 at 16:22
  • If you can't figure out a formal definition for the *sequence* you're after, showing more simple examples (like the one in your last comment), along with expected result and explanation for why you expect it, could possibly be enough for others to come up with a problem statement that works for you. For example in the sequence `991 1 2 3 4 5 6 7 8 9 992 993 994 995 996 997 998 999 1000` would you say that `991` or `1 2 3 4 5 6 7 8 9` are out of sequence and why? – arekolek Jul 13 '16 at 16:34
  • Thanks - ultimately the "sequence" is determined by the entire list: _the longest non-decreasing run of all items considered "en masse"_ EG for 1, 991, 992, 2, 993, 3 it's impossible to know... For 1, 991, 992, 2, 993, 3, 4 it's 1,2,3,4. – Trees4theForest Jul 13 '16 at 22:05
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/117260/discussion-between-arekolek-and-trees4theforest). – arekolek Jul 13 '16 at 22:11
0

This will establish a new anchor_date for you if the current date is greater than the last good date.

import arrow

out_of_order = []
anchor_date = arrow.get(dates[0][0])
for dt, name in dates:
  if arrow.get(dt) < anchor_date:
    out_of_order.append((dt, name))
  else:
    anchor_date = arrow.get(dt)
Hans Nelsen
  • 297
  • 2
  • 4