tokenize a string keeping delimiters in Python

Question

Is there any equivalent to str.split in Python that also returns the delimiters?

I need to preserve the whitespace layout for my output after processing some of the tokens.

Example:

>>> s="\tthis is an  example"
>>> print s.split()
['this', 'is', 'an', 'example']

>>> print what_I_want(s)
['\t', 'this', ' ', 'is', ' ', 'an', '  ', 'example']

Thanks!

+1 - Interesting question, `splitlines` seems to have a `keepends` parameter, but no such thing for `split`. Seems odd (http://docs.python.org/library/stdtypes.html#str.splitlines). — Dominic Rodger, Nov 30 '09 at 15:06

score 19 · Accepted Answer · answered Nov 30 '09 at 15:08

19

How about

import re
splitter = re.compile(r'(\s+|\S+)')
splitter.findall(s)

answered Nov 30 '09 at 15:08

Jonathan Feinberg

44,698
7
80
103

elegant and easily expandable (think `(\s+|\w+|\S+)`). – Nov 30 '09 at 15:16

score 6 · Answer 2 · answered Nov 30 '09 at 15:08

6

>>> re.compile(r'(\s+)').split("\tthis is an  example")
['', '\t', 'this', ' ', 'is', ' ', 'an', '  ', 'example']

answered Nov 30 '09 at 15:08

Denis Otkidach

32,032
8
79
100

Tim Pietzcker · Answer 3 · 2009-11-30T15:22:56.603

4

the re module provides this functionality:

>>> import re
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']

(quoted from the Python documentation).

For your example (split on whitespace), use re.split('(\s+)', '\tThis is an example').

The key is to enclose the regex on which to split in capturing parentheses. That way, the delimiters are added to the list of results.

Edit: As pointed out, any preceding/trailing delimiters will of course also be added to the list. To avoid that you can use the .strip() method on your input string first.

edited Nov 30 '09 at 15:22

answered Nov 30 '09 at 15:09

Tim Pietzcker

328,213
58
503
561

not using the OP's string masks the fact that the empty string is included as the first element of the returned list. – Nov 30 '09 at 15:17
Thanks. I edited my post accordingly (although in this case, the OP's spec ("want to preserve whitespace") and his example were contradictory). – Tim Pietzcker Nov 30 '09 at 15:24
No, it wasn't... there was one example of the current behaviour, and another of the desired one. – fortran Dec 01 '09 at 09:00

jcdyer · Answer 4 · 2009-11-30T17:03:30.760

Have you looked at pyparsing? Example borrowed from the pyparsing wiki:

>>> from pyparsing import Word, alphas
>>> greet = Word(alphas) + "," + Word(alphas) + "!"
>>> hello1 = 'Hello, World!'
>>> hello2 = 'Greetings, Earthlings!'
>>> for hello in hello1, hello2:
...     print (u'%s \u2192 %r' % (hello, greet.parseString(hello))).encode('utf-8')
... 
Hello, World! → (['Hello', ',', 'World', '!'], {})
Greetings, Earthlings! → (['Greetings', ',', 'Earthlings', '!'], {})

score -1 · Answer 5 · answered Nov 30 '09 at 15:28

Thanks guys for pointing for the re module, I'm still trying to decide between that and using my own function that returns a sequence...

def split_keep_delimiters(s, delims="\t\n\r "):
    delim_group = s[0] in delims
    start = 0
    for index, char in enumerate(s):
        if delim_group != (char in delims):
            delim_group ^= True
            yield s[start:index]
            start = index
    yield s[start:index+1]

If I had time I'd benchmark them xD

no need regex or creating your own wheels if you have python 2.5 onwards.. see my answer. — ghostdog74, Dec 01 '09 at 00:09

tokenize a string keeping delimiters in Python

5 Answers5

Linked