2

Here's an example data I have,

word_indices = [
 ('bus', 554, 1),
 ('bus', 719, 1),
 ('bus', 808, 1),
 ('accessibility', 572, 2),
 ('accessibility', 724, 2),
 ('accessibility', 809, 2),
 ('ada', 725, 3),
 ('ada', 810, 3),
 ('accessible', 695, 4),
 ('accessible', 707, 4),
 ('accessible', 726, 4),
 ('accessible', 811, 4),
 ('get', 10, 5),
 ('get', 17, 5),
 ('get', 98, 5),
 ('get', 179, 5),
 ('get', 733, 5),
 ('get', 812, 5),
 ('tickets', 734, 6),
 ('tickets', 813, 6),
 ('tickets', 907, 6),
 ('nov', 736, 7),
 ('nov', 815, 7),
 ('ticket', 816, 8),
 ('ticket', 818, 8),
 ('ticket', 828, 8),
 ('information', 817, 9),
 ('ticket', 816, 10),
 ('ticket', 818, 10),
 ('ticket', 828, 10),
 ('includes', 819, 11),
 ('includes', 834, 11),
 ('wine', 760, 12),
 ('wine', 820, 12),
 ('beer', 821, 13),
 ('supper', 822, 14),
 ('performance', 262, 15),
 ('performance', 278, 15),
 ('performance', 399, 15),
 ('performance', 823, 15),
 ('and', 97, 16),
 ('and', 178, 16),
 ('and', 261, 16),
 ('and', 366, 16),
 ('and', 370, 16),
 ('and', 397, 16),
 ('and', 501, 16),
 ('and', 581, 16),
 ('and', 636, 16),
 ('and', 677, 16),
 ('and', 711, 16),
 ('and', 824, 16),
 ('and', 833, 16),
 ('and', 852, 16),
 ('and', 871, 16),
 ('and', 928, 16),
 ('and', 1017, 16),
 ('and', 1026, 16),
 ('and', 1044, 16),
 ('and', 1088, 16),
 ('and', 1092, 16),
 ('and', 1111, 16),
 ('and', 1126, 16),
 ('and', 1150, 16),
 ('and', 1160, 16),
 ('and', 1166, 16),
 ('and', 1178, 16),
 ('and', 1181, 16),
 ('light', 502, 17),
 ('light', 825, 17),
 ('dessert', 826, 18),
 ('benefactor', 827, 19),
 ('ticket', 816, 20),
 ('ticket', 818, 20),
 ('ticket', 828, 20),
 ('adds', 829, 21),
 ('additional', 831, 22),
 ('contribution', 832, 23),
 ('and', 97, 24),
 ('and', 178, 24),
 ('and', 261, 24),
 ('and', 366, 24),
 ('and', 370, 24),
 ('and', 397, 24),
 ('and', 501, 24),
 ('and', 581, 24),
 ('and', 636, 24),
 ('and', 677, 24),
 ('and', 711, 24),
 ('and', 824, 24),
 ('and', 833, 24),
 ('and', 852, 24),
 ('and', 871, 24),
 ('and', 928, 24),
 ('and', 1017, 24),
 ('and', 1026, 24),
 ('and', 1044, 24),
 ('and', 1088, 24),
 ('and', 1092, 24),
 ('and', 1111, 24),
 ('and', 1126, 24),
 ('and', 1150, 24),
 ('and', 1160, 24),
 ('and', 1166, 24),
 ('and', 1178, 24),
 ('and', 1181, 24),
 ('includes', 819, 25),
 ('includes', 834, 25),
 ('special', 240, 26),
 ('special', 255, 26),
 ('special', 316, 26),
 ('special', 465, 26),
 ('special', 759, 26),
 ('special', 836, 26),
 ('special', 1027, 26),
 ('goodie', 837, 27),
 ('bag', 838, 28),
 ('upon', 839, 29),
 ('departure', 841, 30),
 ('health', 90, 31),
 ('health', 171, 31),
 ('health', 842, 31),
 ('health', 1000, 31),
 ('health', 1292, 31),
 ('health', 1313, 31),
 ('adhere', 845, 32),
 ('centers', 848, 33),
 ('disease', 850, 34),
 ('control', 851, 35),
 ('and', 97, 36),
 ('and', 178, 36),
 ('and', 261, 36),
 ('and', 366, 36),
 ('and', 370, 36),
 ('and', 397, 36),
 ('and', 501, 36),
 ('and', 581, 36),
 ('and', 636, 36),
 ('and', 677, 36),
 ('and', 711, 36),
 ('and', 824, 36),
 ('and', 833, 36),
 ('and', 852, 36),
 ('and', 871, 36),
 ('and', 928, 36),
 ('and', 1017, 36),
 ('and', 1026, 36),
 ('and', 1044, 36),
 ('and', 1088, 36),
 ('and', 1092, 36),
 ('and', 1111, 36),
 ('and', 1126, 36),
 ('and', 1150, 36),
 ('and', 1160, 36),
 ('and', 1166, 36),
 ('and', 1178, 36),
 ('and', 1181, 36),
 ('prevention', 853, 37),
 ('cdc', 268, 38),
 ('cdc', 854, 38),
 ('covid', 855, 39),
 ('guidelines', 271, 40),
 ('guidelines', 856, 40),
 ('living', 51, 41),
 ('living', 132, 41),
 ('living', 188, 41),
 ('living', 195, 41),
 ('living', 213, 41),
 ('living', 233, 41),
 ('living', 303, 41),
 ('living', 588, 41),
 ('living', 859, 41),
 ('living', 942, 41),
 ('living', 978, 41),
 ('living', 986, 41),
 ('living', 1227, 41),
 ('living', 1245, 41),
 ('room', 52, 42),
 ('room', 133, 42),
 ('room', 189, 42),
 ('room', 196, 42),
 ('room', 214, 42),
 ('room', 234, 42),
 ('room', 304, 42),
 ('room', 860, 42),
 ('room', 943, 42),
 ('room', 979, 42),
 ('room', 987, 42),
 ('room', 1228, 42),
 ('room', 1246, 42),
 ('tour', 40, 43),
 ('tour', 53, 43),
 ('tour', 121, 43),
 ('tour', 134, 43),
 ('tour', 190, 43),
 ('tour', 197, 43),
 ('tour', 215, 43),
 ('tour', 235, 43),
 ('tour', 305, 43),
 ('tour', 861, 43),
 ('tour', 944, 43),
 ('tour', 980, 43),
 ('tour', 988, 43),
 ('tour', 1201, 43),
 ('tour', 1220, 43),
 ('tour', 1229, 43),
 ('tour', 1247, 43)]

('bus', 719, 1) - i.e., word, word index and group number

I am trying to find best path (i.e., minimal distance) between each group word indices. (A path is a sequential group numbers)

Distance is sum of absolute differences between word group indices.

Example Output:

  • In group 1, we will have to select ('bus', 808, '1')
  • In group 2, we should get ('accessibility', 809, 2)
  • In group 3, we should get ('ada', 810, 3)
  • In group 4, We should get ('accessible', 811, 4)
  • In group 5, we should get ('get', 812, 5)
  • In group 6, we should get ('tickets', 813, 6) and so on....

Choose the path - (808, 809, 810, 811, 812, 813) rather than (719, 724, 725, 726, 733, 734) because the first path has minimal distance (i.e., absolute difference).

Trying to find an efficient & scalable approach but I can't figure out the logic.

from itertools import groupby

matches = []
first_group = None
for en, (key, group) in enumerate(groupby(word_indices, key=lambda x: (x[0], x[-1]))):
    current_group = list(group)
    if en < 1:
        first_group = current_group
        continue
    if first_group is not None and group:
        for rowx, ix, group_idx in first_group:
            for rowy, iy, group_idy in current_group:
                if ix - iy <= 10:
                    break

Please do help on this, if someone is familiar with an approach that will work. Much appreciated! Thank you

user_12
  • 1,778
  • 7
  • 31
  • 72
  • have you tried Dijkstra? – Marat Nov 28 '22 at 00:41
  • @Marat Sorry, I am not familiar with it. I haven't tried it. – user_12 Nov 28 '22 at 00:42
  • 4
    I did not downvote, but the question is really hard to read. I guess somebody gave up midway and let their frustration out this way – Marat Nov 28 '22 at 00:42
  • 2
    are start/end group fixed? What qualifies as a path (is it just sequential group numbers)? Can you elaborate on distance, is it a sum of absolute differences between word group indices? – Marat Nov 28 '22 at 00:43
  • @Marat Thank you for letting me know. I'll try to rephrase things. I just wanted to explain the question thoroughly. – user_12 Nov 28 '22 at 00:44
  • @Marat Yes a path is a sequential group numbers. Distance is the sum of absolute differences between word group indices (i.e., 808-809 etc.,). Start and end group may differ like some data can contain 100 groups from 0 or 30 groups etc., It starts with 0 to N. – user_12 Nov 28 '22 at 00:46
  • 2
    Should try to edit the post with more details. Not in comments. – Daniel Hao Nov 28 '22 at 00:53
  • @DanielHao Edited the question and added few things to clarify. – user_12 Nov 28 '22 at 00:57

1 Answers1

1
from collections import defaultdict
import heapq

def dijkstra(word_indices):
    groups = defaultdict(list)
    for word, group_index, group in word_indices:
        groups[group].append((word, group_index))
    start, stop = min(groups), max(groups)
    # queue contains distance, path, where path is a tuple of word_indices triplets
    # we start with all words from the first group and distance 0
    q = [(0, ((word, group_index, start),)) for word, group_index in groups[start]]
    while q:
        past_distance, path = heapq.heappop(q)
        _, last_group_index, last_group = path[-1]
        if last_group == stop:
            return past_distance, path
        next_group = last_group + 1
        for word, group_index in groups[next_group]:
            heapq.heappush(q, (past_distance + abs(last_group_index - group_index), path + ((word, group_index, next_group),)))
    

dijkstra(word_indices) returns a path of length 53:

(53,
 (('bus', 808, 1),
  ('accessibility', 809, 2),
  ('ada', 810, 3),
  ('accessible', 811, 4),
  ('get', 812, 5),
  ('tickets', 813, 6),
  ('nov', 815, 7),
  ('ticket', 816, 8),
  ('information', 817, 9),
  ('ticket', 818, 10),
  ('includes', 819, 11),
  ('wine', 820, 12),
  ('beer', 821, 13),
  ('supper', 822, 14),
  ('performance', 823, 15),
  ('and', 824, 16),
  ('light', 825, 17),
  ('dessert', 826, 18),
  ('benefactor', 827, 19),
  ('ticket', 828, 20),
  ('adds', 829, 21),
  ('additional', 831, 22),
  ('contribution', 832, 23),
  ('and', 833, 24),
  ('includes', 834, 25),
  ('special', 836, 26),
  ('goodie', 837, 27),
  ('bag', 838, 28),
  ('upon', 839, 29),
  ('departure', 841, 30),
  ('health', 842, 31),
  ('adhere', 845, 32),
  ('centers', 848, 33),
  ('disease', 850, 34),
  ('control', 851, 35),
  ('and', 852, 36),
  ('prevention', 853, 37),
  ('cdc', 854, 38),
  ('covid', 855, 39),
  ('guidelines', 856, 40),
  ('living', 859, 41),
  ('room', 860, 42),
  ('tour', 861, 43)))
Marat
  • 15,215
  • 2
  • 39
  • 48
  • Thank you so much. Its working for me. On larger corpus this is taking a lot of time, is there a way I can optimize this further? Please do let me know – user_12 Nov 28 '22 at 11:22
  • [Dijkstra](https://en.wikipedia.org/wiki/Dijkstra_algorithm) is the gold standard for shortest path algorithms. It would be extremely hard to improve upon that without exploiting known regularities in input data - and even in this case, it's more likely to be a publishable work in algorithms rather than a quick SO answer – Marat Nov 28 '22 at 15:22
  • If you are willing to dive into further optimization, I'd recommend looking into [branch and bound](https://en.wikipedia.org/wiki/Branch_and_bound) approach. It's really easy to implement ([example](https://stackoverflow.com/questions/74538180)), if you can come up with good bounds – Marat Nov 28 '22 at 15:30
  • Thank you for the references. I'll definitely check it out. – user_12 Nov 28 '22 at 16:58