5

This might be heavily related to similar questions as Python 3.3: Split string and create all combinations , but I can't infer a pythonic solution out of this.

Question is:

Let there be a str such as 'hi|guys|whats|app', and I need all permutations of splitting that str by a separator. Example:

#splitting only once
['hi','guys|whats|app']
['hi|guys','whats|app']
['hi|guys|whats','app']
#splitting only twice
['hi','guys','whats|app']
['hi','guys|whats','app']
#splitting only three times
...
etc

I could write a backtracking algorithm, but does python (itertools, e.g.) offer a library that simplifies this algorithm?

Thanks in advance!!

glezo
  • 742
  • 2
  • 9
  • 22
  • you can do it using `.split("|", n)` where n is the max number of splits you need – Mohamed Yahya Oct 13 '21 at 12:32
  • 4
    that won't do what he wants. – Capie Oct 13 '21 at 12:32
  • 1
    @MohamedYahya yeah, but it'll also return like `['hi','guys','whats|app']` – Ghost Ops Oct 13 '21 at 12:33
  • yeah yeah, I see what he needs, thanks – Mohamed Yahya Oct 13 '21 at 12:40
  • You have `n` possible locations where you can choose to split, or not split. Find the indices of all those possible split locations. Pick `k` of them, and perform the split at those locations. For all possible strings given `k` splits, you're just finding all possible ways to choose `k` elements out of `n`. This grows pretty quickly though, so it becomes pretty intractable pretty fast. – Alexander Oct 13 '21 at 12:57
  • 1
    the list of possible splits is just the binary form of `range(2**number_of_separators)` you should be able to do the rest – diggusbickus Oct 13 '21 at 13:17
  • Why no `['hi|guys','whats','app']` when splitting twice? – no comment Oct 13 '21 at 16:54

5 Answers5

1

An approach, once you have split the string is to use itertools.combinations to define the split points in the list, the other positions should be fused again.

def lst_merge(lst, positions, sep='|'):
    '''merges a list on points other than positions'''
    '''A, B, C, D and 0, 1 -> A, B, C|D'''
    a = -1
    out = []
    for b in list(positions)+[len(lst)-1]:
        out.append('|'.join(lst[a+1:b+1]))
        a = b
    return out

def split_comb(s, split=1, sep='|'):
    from itertools import combinations
    l = s.split(sep)
    return [lst_merge(l, pos, sep=sep)
            for pos in combinations(range(len(l)-1), split)]

examples

>>> split_comb('hi|guys|whats|app', 0)
[['hi|guys|whats|app']]

>>> split_comb('hi|guys|whats|app', 1)
[['hi', 'guys|whats|app'],
 ['hi|guys', 'whats|app'],
 ['hi|guys|whats', 'app']]

>>> split_comb('hi|guys|whats|app', 2)
[['hi', 'guys', 'whats|app'],
 ['hi', 'guys|whats', 'app'],
 ['hi|guys', 'whats', 'app']]

>>> split_comb('hi|guys|whats|app', 3)
[['hi', 'guys', 'whats', 'app']]

>>> split_comb('hi|guys|whats|app', 4)
[] ## impossible

rationale

ABCD -> A B C D
         0 1 2

combinations of split points: 0/1 or 0/2 or 1/2

0/1 -> merge on 2 -> A B CD
0/2 -> merge on 1 -> A BC D
1/2 -> merge on 0 -> AB C D

generic function

Here is a generic version, working like above but also taking -1 as parameter for split, in which case it will output all combinations

def lst_merge(lst, positions, sep='|'):
    a = -1
    out = []
    for b in list(positions)+[len(lst)-1]:
        out.append('|'.join(lst[a+1:b+1]))
        a = b
    return out

def split_comb(s, split=1, sep='|'):
    from itertools import combinations, chain
    
    l = s.split(sep)
    
    if split == -1:
        pos = chain.from_iterable(combinations(range(len(l)-1), r)
                                  for r in range(len(l)+1))
    else:
        pos = combinations(range(len(l)-1), split)
        
    return [lst_merge(l, pos, sep=sep)
            for pos in pos]

example:

>>> split_comb('hi|guys|whats|app', -1)
[['hi|guys|whats|app'],
 ['hi', 'guys|whats|app'],
 ['hi|guys', 'whats|app'],
 ['hi|guys|whats', 'app'],
 ['hi', 'guys', 'whats|app'],
 ['hi', 'guys|whats', 'app'],
 ['hi|guys', 'whats', 'app'],
 ['hi', 'guys', 'whats', 'app']]
mozway
  • 194,879
  • 13
  • 39
  • 75
  • You're very welcome @glezo Out of curiosity, what is the use case for this code? – mozway Oct 13 '21 at 16:13
  • Hard one: csv delimited by pipes ("|"), having some fields pipe characters cannot be inserted into a pandas dataframe. So, i'll try to tokenize these faulty rows into all possible permutations and insert them to pandas' dataframe =D – glezo Oct 13 '21 at 16:35
0

One approach using combinations and chain

from itertools import combinations, chain


def partition(alist, indices):
    # https://stackoverflow.com/a/1198876/4001592
    pairs = zip(chain([0], indices), chain(indices, [None]))
    return (alist[i:j] for i, j in pairs)


s = 'hi|guys|whats|app'
delimiter_count = s.count("|")
splits = s.split("|")


for i in range(1, delimiter_count + 1):
    print("split", i)
    for combination in combinations(range(1, delimiter_count + 1), i):
        res = ["|".join(part) for part in partition(splits, combination)]
        print(res)

Output

split 1
['hi', 'guys|whats|app']
['hi|guys', 'whats|app']
['hi|guys|whats', 'app']
split 2
['hi', 'guys', 'whats|app']
['hi', 'guys|whats', 'app']
['hi|guys', 'whats', 'app']
split 3
['hi', 'guys', 'whats', 'app']

The idea is to generate all the ways to pick (or remove) a delimiter 1, 2, 3 times and generate the partitions from there.

Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
0

Here is a recursive function I came up with:

def splitperms(string, i=0):
    if len(string) == i:
        return [[string]]
    elif string[i] == "|":
        return [*[[string[:i]] + split for split in splitperms(string[i + 1:])], *splitperms(string, i + 1)]
    else:
        return splitperms(string, i + 1)

Output:

>>> splitperms('hi|guys|whats|app')
[['hi', 'guys', 'whats', 'app'], ['hi', 'guys', 'whats|app'], ['hi', 'guys|whats', 'app'], ['hi', 'guys|whats|app'], ['hi|guys', 'whats', 'app'], ['hi|guys', 'whats|app'], ['hi|guys|whats', 'app'], ['hi|guys|whats|app']]
>>> 
U13-Forward
  • 69,221
  • 14
  • 89
  • 114
0

If you want all partitions, try partitions from more-itertools:

from more_itertools import partitions

s = 'hi|guys|whats|app'

for p in partitions(s.split('|')):
    print(list(map('|'.join, p)))

Output:

['hi|guys|whats|app']
['hi', 'guys|whats|app']
['hi|guys', 'whats|app']
['hi|guys|whats', 'app']
['hi', 'guys', 'whats|app']
['hi', 'guys|whats', 'app']
['hi|guys', 'whats', 'app']
['hi', 'guys', 'whats', 'app']

If you only want a certain number of splits, then instead of splitting at all separators and then re-joining the parts, you could just get combinations of the separator indexes and take substrings accordingly:

from itertools import combinations

s = 'hi|guys|whats|app'
splits = 2

indexes = [i for i, c in enumerate(s) if c == '|']
for I in combinations(indexes, splits):
    print([s[i+1:j] for i, j in zip([-1, *I], [*I, None])])

Output:

['hi', 'guys', 'whats|app']
['hi', 'guys|whats', 'app']
['hi|guys', 'whats', 'app']
no comment
  • 6,381
  • 4
  • 12
  • 30
0

I am surprised that most answers are using combinations, where this is clearly a binary power sequence (that is, multiple binary cartesian products concatenated).

Let me elaborate: if we have n separators, we have 2**n possible strings, where each separator is either on or off. So if we map each bit of an integer sequence from 0 to 2**n to each separator (0 means we don't split, 1 means we split) we can generate the whole thing quite efficiently (without running into stack depth limits, and being able to pause and resume the generator -or even run it in parallel!- using just a simple integer to keep track of progress).

def partition(index, tokens, separator):
    def helper():
        n = index
        for token in tokens:
            yield token
            if n % 2:
                yield separator
            n //= 2
    return ''.join(helper())

def all_partitions(txt, separator):
    tokens = txt.split(separator)
    for i in range(2**(len(tokens)-1)):
        yield partition(i, tokens, separator)

for x in all_partitions('hi|guys|whats|app', '|'):
    print(x)

Explanation:

   hi|guys|whats|app
     ^    ^     ^
bit  0    1     2 (big endian representation)
   hi guys whats up
     ^    ^     ^
0 =  0    0     0


   hi|guys whats up
     ^    ^     ^
1 =  1    0     0


   hi guys|whats up
     ^    ^     ^
2 =  0    1     0


   hi|guys|whats up
     ^    ^     ^
3 =  1    1     0


   hi guys whats|up
     ^    ^     ^
4 =  0    0     1


   hi|guys whats|up
     ^    ^     ^
5 =  1    0     1


   hi guys|whats|up
     ^    ^     ^
6 =  0    1     1


   hi|guys|whats|up
     ^    ^     ^
7 =  1    1     1
fortran
  • 74,053
  • 25
  • 135
  • 175