17

Is there any equivalent to str.split in Python that also returns the delimiters?

I need to preserve the whitespace layout for my output after processing some of the tokens.

Example:

>>> s="\tthis is an  example"
>>> print s.split()
['this', 'is', 'an', 'example']

>>> print what_I_want(s)
['\t', 'this', ' ', 'is', ' ', 'an', '  ', 'example']

Thanks!

fortran
  • 74,053
  • 25
  • 135
  • 175
  • 1
    +1 - Interesting question, `splitlines` seems to have a `keepends` parameter, but no such thing for `split`. Seems odd (http://docs.python.org/library/stdtypes.html#str.splitlines). – Dominic Rodger Nov 30 '09 at 15:06

5 Answers5

19

How about

import re
splitter = re.compile(r'(\s+|\S+)')
splitter.findall(s)
Jonathan Feinberg
  • 44,698
  • 7
  • 80
  • 103
6
>>> re.compile(r'(\s+)').split("\tthis is an  example")
['', '\t', 'this', ' ', 'is', ' ', 'an', '  ', 'example']
Denis Otkidach
  • 32,032
  • 8
  • 79
  • 100
4

the re module provides this functionality:

>>> import re
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']

(quoted from the Python documentation).

For your example (split on whitespace), use re.split('(\s+)', '\tThis is an example').

The key is to enclose the regex on which to split in capturing parentheses. That way, the delimiters are added to the list of results.

Edit: As pointed out, any preceding/trailing delimiters will of course also be added to the list. To avoid that you can use the .strip() method on your input string first.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • not using the OP's string masks the fact that the empty string is included as the first element of the returned list. –  Nov 30 '09 at 15:17
  • Thanks. I edited my post accordingly (although in this case, the OP's spec ("want to preserve whitespace") and his example were contradictory). – Tim Pietzcker Nov 30 '09 at 15:24
  • No, it wasn't... there was one example of the current behaviour, and another of the desired one. – fortran Dec 01 '09 at 09:00
3

Have you looked at pyparsing? Example borrowed from the pyparsing wiki:

>>> from pyparsing import Word, alphas
>>> greet = Word(alphas) + "," + Word(alphas) + "!"
>>> hello1 = 'Hello, World!'
>>> hello2 = 'Greetings, Earthlings!'
>>> for hello in hello1, hello2:
...     print (u'%s \u2192 %r' % (hello, greet.parseString(hello))).encode('utf-8')
... 
Hello, World! → (['Hello', ',', 'World', '!'], {})
Greetings, Earthlings! → (['Greetings', ',', 'Earthlings', '!'], {})
jcdyer
  • 18,616
  • 5
  • 42
  • 49
-1

Thanks guys for pointing for the re module, I'm still trying to decide between that and using my own function that returns a sequence...

def split_keep_delimiters(s, delims="\t\n\r "):
    delim_group = s[0] in delims
    start = 0
    for index, char in enumerate(s):
        if delim_group != (char in delims):
            delim_group ^= True
            yield s[start:index]
            start = index
    yield s[start:index+1]

If I had time I'd benchmark them xD

fortran
  • 74,053
  • 25
  • 135
  • 175