How can I join a list of strings and remove duplicated letters (keep them chained)

Question

My list:

l = ["volcano", "noway", "lease", "sequence", "erupt"]

Desired output:

'volcanowayleasequencerupt'

I have tried:

using itertools.groupby but it seems like it doesn't work well when there is 2 repeated letters in row (i.e. leasesequence -> sese stays):

>>> from itertools import groupby
>>> "".join([i[0] for i in groupby("".join(l))])
'volcanonowayleasesequencerupt'

As you can see it got rid only for the last 'e', and this is not ideal because if a letter has double characters they will be shrunk to 1. i.e 'suddenly' becomes 'sudenly'.

I'm looking for the most Pythonic approach for this.

Thank you in advance.

EDIT

My list does not have any duplicated items in it.

What do you expect for `l = ['split', 'it', 'lit']`, where the third word matches further back than the second? — Kelly Bundy, Jan 27 '20 at 15:09
So the rule is to crop each word just by its overlap with the word directly before it? — Kelly Bundy, Jan 27 '20 at 15:10
Does that can help [https://stackoverflow.com/questions/28188296/finding-out-whether-there-exist-two-identical-substrings-one-next-to-another](https://stackoverflow.com/questions/28188296/finding-out-whether-there-exist-two-identical-substrings-one-next-to-another) — Clément, Jan 27 '20 at 15:22

score 4 · Answer 1 · answered Jan 27 '20 at 15:22

Using a helper function that crops a word t by removing its longest prefix that's also a suffix of s:

def crop(s, t):
    for k in range(len(t), -1, -1):
        if s.endswith(t[:k]):
            return t[k:]

And then crop each word with its preceding word:

>>> l = ["volcano", "noway", "lease", "sequence", "erupt"]
>>> ''.join(crop(s, t) for s, t in zip([''] + l, l))
'volcanowayleasequencerupt'

>>> l = ['split', 'it', 'lit']
>>> ''.join(crop(s, t) for s, t in zip([''] + l, l))
'splitlit'

EliadL · Accepted Answer · 2020-01-29T15:16:23.383

2

A more readable version, in my opinion:

from functools import reduce


def max_overlap(s1, s2):

    return next(
        i
        for i in reversed(range(len(s2) + 1))
        if s1.endswith(s2[:i])
    )


def overlap(strs):

    return reduce(
        lambda s1, s2:
            s1 + s2[max_overlap(s1, s2):],
        strs, '',
    )


overlap(l)
#> 'volcanowayleasequencerupt'

However, it also considers "accumulated" characters from previous words that overlapped:

overlap(['split', 'it', 'lit'])
#> 'split'

edited Jan 29 '20 at 15:16

answered Jan 27 '20 at 15:45

EliadL

6,230
2
26
43

1

FWIW, this uses `__add__`, which is slow, and requires up to `n` reallocations and copies for *each* item in the list. That's `O(n^2)`. Slightly faster is `__iadd__`. And `str.join` is faster than either of them on CPython due to the preallocation guaranteeing `O(n)` performance (among other things). – Mateen Ulhaq Jan 29 '20 at 01:20
@MateenUlhaq Thanks for that insight. Indeed I was going for "most pythonic" rather than "most efficient", per OP's request: _"I'm looking for the most Pythonic approach for this."_ – EliadL Jan 29 '20 at 08:41

Mateen Ulhaq · Answer 3 · 2020-01-27T15:27:49.347

Here's a brute-force deduplicator:

def dedup(a, b):
    for i in range(len(b), 0, -1):
        if a[-i:] == b[:i]:
            return a[:-i]
    return a

Then, simply zip through:

>>> from itertools import chain, islice
>>> xs = ["volcano", "noway", "lease", "sequence", "erupt"]
>>> xs = [dedup(*x) for x in zip(xs, chain(islice(xs, 1, None), [""]))]
>>> "".join(xs)
'volcanowayleasequencerupt'

Naturally, this works for any length of list xs.

How can I join a list of strings and remove duplicated letters (keep them chained)

My list:

Desired output:

I have tried:

3 Answers3

Linked