4

My list:

l = ["volcano", "noway", "lease", "sequence", "erupt"]

Desired output:

'volcanowayleasequencerupt'

I have tried:

using itertools.groupby but it seems like it doesn't work well when there is 2 repeated letters in row (i.e. leasesequence -> sese stays):

>>> from itertools import groupby
>>> "".join([i[0] for i in groupby("".join(l))])
'volcanonowayleasesequencerupt'

As you can see it got rid only for the last 'e', and this is not ideal because if a letter has double characters they will be shrunk to 1. i.e 'suddenly' becomes 'sudenly'.

I'm looking for the most Pythonic approach for this.

Thank you in advance.

EDIT

My list does not have any duplicated items in it.

hofD
  • 51
  • 6
  • 3
    What do you expect for `l = ['split', 'it', 'lit']`, where the third word matches further back than the second? – Kelly Bundy Jan 27 '20 at 15:09
  • @HeapOverflow I expect `'splitlit'` for this – hofD Jan 27 '20 at 15:09
  • So the rule is to crop each word just by its overlap with the word directly before it? – Kelly Bundy Jan 27 '20 at 15:10
  • @HeapOverflow Yes, like this. – hofD Jan 27 '20 at 15:12
  • Does that can help [https://stackoverflow.com/questions/28188296/finding-out-whether-there-exist-two-identical-substrings-one-next-to-another](https://stackoverflow.com/questions/28188296/finding-out-whether-there-exist-two-identical-substrings-one-next-to-another) – Clément Jan 27 '20 at 15:22

3 Answers3

4

Using a helper function that crops a word t by removing its longest prefix that's also a suffix of s:

def crop(s, t):
    for k in range(len(t), -1, -1):
        if s.endswith(t[:k]):
            return t[k:]

And then crop each word with its preceding word:

>>> l = ["volcano", "noway", "lease", "sequence", "erupt"]
>>> ''.join(crop(s, t) for s, t in zip([''] + l, l))
'volcanowayleasequencerupt'

>>> l = ['split', 'it', 'lit']
>>> ''.join(crop(s, t) for s, t in zip([''] + l, l))
'splitlit'
Kelly Bundy
  • 23,480
  • 7
  • 29
  • 65
2

A more readable version, in my opinion:

from functools import reduce


def max_overlap(s1, s2):

    return next(
        i
        for i in reversed(range(len(s2) + 1))
        if s1.endswith(s2[:i])
    )


def overlap(strs):

    return reduce(
        lambda s1, s2:
            s1 + s2[max_overlap(s1, s2):],
        strs, '',
    )


overlap(l)
#> 'volcanowayleasequencerupt'

However, it also considers "accumulated" characters from previous words that overlapped:

overlap(['split', 'it', 'lit'])
#> 'split'
EliadL
  • 6,230
  • 2
  • 26
  • 43
  • 1
    FWIW, this uses `__add__`, which is slow, and requires up to `n` reallocations and copies for *each* item in the list. That's `O(n^2)`. Slightly faster is `__iadd__`. And `str.join` is faster than either of them on CPython due to the preallocation guaranteeing `O(n)` performance (among other things). – Mateen Ulhaq Jan 29 '20 at 01:20
  • @MateenUlhaq Thanks for that insight. Indeed I was going for "most pythonic" rather than "most efficient", per OP's request: _"I'm looking for the most Pythonic approach for this."_ – EliadL Jan 29 '20 at 08:41
1

Here's a brute-force deduplicator:

def dedup(a, b):
    for i in range(len(b), 0, -1):
        if a[-i:] == b[:i]:
            return a[:-i]
    return a

Then, simply zip through:

>>> from itertools import chain, islice
>>> xs = ["volcano", "noway", "lease", "sequence", "erupt"]
>>> xs = [dedup(*x) for x in zip(xs, chain(islice(xs, 1, None), [""]))]
>>> "".join(xs)
'volcanowayleasequencerupt'

Naturally, this works for any length of list xs.

Mateen Ulhaq
  • 24,552
  • 19
  • 101
  • 135