Find common sequences of words in sublists in Python

Question

I have a nested list of strings:

    [['Start', 'двигаться', 'другая', 'сторона', 'света', 'надолго', 'скоро'], 
     ['Start', 'двигаться', 'другая', 'сторона', 'света', 'чтобы', 'посмотреть'],
     ['Start', 'двигаться', 'новая', 'планета'],
     ['Start', 'двигаться', 'сторона', 'признание', 'суверенитет', 'израильский'],
     ['Start', 'двигаться', 'сторона', 'признание', 'высот', 'на'],
     ['Start', 'двигаться', 'сторона', 'признание', 'высот', 'оккупировать'],
     ['Start', 'двигаться', 'сторона', 'признание', 'высот', 'Голанский'],
     ['Start', 'двигаться', 'сторона', 'признание', 'и']]

I need an algorithm to find first longest common sequence for two or more words in the list (in this case all sublists has two first common elements 'Start', 'двигаться'), make a string from them, move to next elements, find next longest common for two or more ('сторона', 'света', 'надолго' and 'сторона', 'признание' in this case) if a sublist has this common element, make next string from it. If there are no common elements, just add the rest words as a single string. And so on. If there is one element left in sequence, add it to the previous string. A single common element also does not count as a sequence. Resulting sequences can be of any length and split may start from the first element. Desires output:

    [['Start двигаться', 'другая сторона света', 'надолго скоро'], 
     ['Start двигаться', 'другая сторона света', 'чтобы посмотреть'],
     ['Start двигаться', 'новая планета'],
     ['Start двигаться', 'сторона признание', 'суверенитет израильский'],
     ['Start двигаться', 'сторона признание', 'высот на'],
     ['Start двигаться', 'сторона признание', 'высот оккупировать'],
     ['Start двигаться', 'сторона признание', 'высот Голанский'],
     ['Start двигаться', 'сторона признание и']]

I've checked other LCS topics but didn't find a solution.

Considering your desired output, I wouldn't say "for two or more," since that would group `'сторона', 'признание', 'высот'` rather than just `'сторона', 'признание'`. Please clarify if "two or more" refers to words or lists. It looks like you might want "the most lists with two or more common words" in each step. — גלעד ברקן, May 23 '18 at 10:09
It's also unclear if you'd like an optimal solution, for example, minimizing the number of total number of strings in the final result. What if there's a choice between grouping common words on the right or the left of the lists? Do you expect to necessarily group words on the left side, moving to the right? — גלעד ברקן, May 23 '18 at 10:12
Is the output required to be ordered or could the lists appear in any order (in that case, I might have a solution based on automata)? — L3viathan, May 23 '18 at 10:16
@גלעדברקן First of all, thanks for your reply! Tried to clarify the question, hope it helps. Any solution with desired output would be really great, as I currently have no idea, where to start. — Alex Nikitin, May 23 '18 at 10:18
@L3viathan Thanks for your reply! Lists can appear in any order, only order of elements inside nested lists matter. — Alex Nikitin, May 23 '18 at 10:19
Looks like building a word level TRIE and then combining the words along parts of paths that have only one branch except for the last which is combined with every path to the leaf. Would produce the desired result in the given case. — Dan D., May 23 '18 at 10:24
@DanD. Yes, really looks like TRIE. I'm rather new to programming and especially algorithms, so help would be really appreciated! — Alex Nikitin, May 23 '18 at 10:36
Why is "высот" combined with each of the three different words that follow it? This doesn't seem to fit the pattern, and the algorithm suggested by @DanD. above requires a special exception for this case. — Sven Marnach, May 27 '18 at 07:06
@SvenMarnach Thanks for your reply! I marked that 'for two or more words in the list', but I added some text to clarify this. — Alex Nikitin, May 27 '18 at 07:10

Find common sequences of words in sublists in Python

0 Answers0