How do I do what strtok() does in C, in Python?

Question

I am learning Python and trying to figure out an efficient way to tokenize a string of numbers separated by commas into a list. Well formed cases work as I expect, but less well formed cases not so much.

If I have this:

A = '1,2,3,4'
B = [int(x) for x in A.split(',')]

B results in [1, 2, 3, 4]

which is what I expect, but if the string is something more like

A = '1,,2,3,4,'

if I'm using the same list comprehension expression for B as above, I get an exception. I think I understand why (because some of the "x" string values are not integers), but I'm thinking that there would be a way to parse this still quite elegantly such that tokenization of the string a works a bit more directly like strtok(A,",\n\t") would have done when called iteratively in C.

To be clear what I am asking; I am looking for an elegant/efficient/typical way in Python to have all of the following example cases of strings:

A='1,,2,3,\n,4,\n'
A='1,2,3,4'
A=',1,2,3,4,\t\n'
A='\n\t,1,2,3,,4\n'

return with the same list of:

B=[1,2,3,4]

via some sort of compact expression.

score 30 · Accepted Answer · answered Jan 18 '09 at 23:09

30

How about this:

A = '1, 2,,3,4  '
B = [int(x) for x in A.split(',') if x.strip()]

x.strip() trims whitespace from the string, which will make it empty if the string is all whitespace. An empty string is "false" in a boolean context, so it's filtered by the if part of the list comprehension.

answered Jan 18 '09 at 23:09

Dave Ray

39,616
7
83
82

strip() is overkill here, converting to int already takes care of the whitespace... better filter out the invalid substrings. – Algorias Jan 19 '09 at 02:00
strip() is necessary, or you will get a ValueError: invalid literal for int() with base 10: ' ' – Ryan Chen Apr 28 '21 at 14:12

score 4 · Answer 2 · answered Oct 16 '16 at 14:55

For the sake of completeness, I will answer this seven year old question: The C program that uses strtok:

int main()
{
    char myLine[]="This is;a-line,with pieces";
    char *p;
    for(p=strtok(myLine, " ;-,"); p != NULL; p=strtok(NULL, " ;-,"))
    {
        printf("piece=%s\n", p);
    }
}

can be accomplished in python with re.split as:

import re
myLine="This is;a-line,with pieces"
for p in re.split("[ ;\-,]",myLine):
    print("piece="+p)

score 4 · Answer 3 · answered Jan 19 '09 at 01:54

4

Generally, I try to avoid regular expressions, but if you want to split on a bunch of different things, they work. Try this:

import re
result = [int(x) for x in filter(None, re.split('[,\n,\t]', A))]

answered Jan 19 '09 at 01:54

Nick

21,555
18
47
50

Alec Thomas · Answer 4 · 2009-01-19T02:59:58.680

4

Mmm, functional goodness (with a bit of generator expression thrown in):

a = "1,2,,3,4,"
print map(int, filter(None, (i.strip() for i in a.split(','))))

For full functional joy:

import string
a = "1,2,,3,4,"
print map(int, filter(None, map(string.strip, a.split(','))))

edited Jan 19 '09 at 02:59

answered Jan 19 '09 at 02:54

Alec Thomas

19,639
4
30
24

1

`print map(int, filter(len, map(str.strip, a.split(','))))` Note: str.strip(i) and i.strip() are the same (no need in `string` module). `len` is used for readability. – jfs Jan 19 '09 at 19:35
I prefer `string.strip()` as it safely deals with Unicode strings. – Alec Thomas Jan 20 '09 at 01:58
@Alec: `map(type(a).strip, a.split(','))` will work both for Unicode and encoded strings. – jfs Jan 20 '09 at 04:16

score 1 · Answer 5 · answered Jan 18 '09 at 23:41

1

This will work, and never raise an exception, if all the numbers are ints. The isdigit() call is false if there's a decimal point in the string.

>>> nums = ['1,,2,3,\n,4\n', '1,2,3,4', ',1,2,3,4,\t\n', '\n\t,1,2,3,,4\n']
>>> for n in nums:
...     [ int(i.strip()) for i in n if i.strip() and i.strip().isdigit() ]
... 
[1, 2, 3, 4]
[1, 2, 3, 4]
[1, 2, 3, 4]
[1, 2, 3, 4]

answered Jan 18 '09 at 23:41

runeh

3,771
1
21
16

The isdigit check is not necessary for the test cases provided, but it does add extra robustness. – Carl Meyer Jan 19 '09 at 00:50
The first i.strip() is redundant. – Ryan Ginstrom Jan 19 '09 at 02:48
You're performing i.strip() three times per element. Yikes. – Kenan Banks Jan 19 '09 at 03:33

Algorias · Answer 6 · 2009-01-19T01:55:37.000

1

How about this?

>>> a = "1,2,,3,4,"
>>> map(int,filter(None,a.split(",")))
[1, 2, 3, 4]

filter will remove all false values (i.e. empty strings), which are then mapped to int.

EDIT: Just tested this against the above posted versions, and it seems to be significantly faster, 15% or so compared to the strip() one and more than twice as fast as the isdigit() one

edited Jan 19 '09 at 01:55

answered Jan 19 '09 at 01:49

Algorias

3,043
5
22
16

he needs it to filter whitespace – hasen Jan 19 '09 at 01:54
Yes, this will filter out whitespace: >>> int(" 1 ") ==> 1 – Algorias Jan 19 '09 at 01:56
This fails if any of the empty entries has whitespace, e.g. '\n\t,1,2,3,,4\n' – Dave Ray Jan 19 '09 at 02:25

joeforker · Answer 7 · 2009-01-27T13:44:19.020

1

Why accept inferior substitutes that cannot segfault your interpreter? With ctypes you can just call the real thing! :-)

# strtok in Python
from ctypes import c_char_p, cdll

try: libc = cdll.LoadLibrary('libc.so.6')
except WindowsError:
     libc = cdll.LoadLibrary('msvcrt.dll')

libc.strtok.restype = c_char_p
dat = c_char_p("1,,2,3,4")
sep = c_char_p(",\n\t")
result = [libc.strtok(dat, sep)] + list(iter(lambda: libc.strtok(None, sep), None))
print(result)

edited Jan 27 '09 at 13:44

answered Jan 19 '09 at 16:43

joeforker

40,459
37
151
246

added `cdll.LoadLibrary('msvcrt.dll')` – jfs Jan 19 '09 at 20:05
When I saw the `while` loop I'd thought of `lambda tokenize dat, sep: itertools.chain((strtok(dat, sep),), iter(lambda: strtok(None, sep), None))`. It is funny how similar programmers' minds work. Add the smile emoticon back otherwise somebody can think that the code is not a joke. – jfs Jan 21 '09 at 19:44

score 0 · Answer 8 · answered Jan 19 '09 at 02:50

0

Why not just wrap in a try except block which catches anything not an integer?

answered Jan 19 '09 at 02:50

Josh Smeaton

47,939
24
129
164

score 0 · Answer 9 · answered Mar 12 '20 at 17:47

I was desperately in need of strtok equivalent in Python. So I developed a simple one by my own

def strtok(val,delim):
    token_list=[]
    token_list.append(val)  
    for key in delim:       
        nList=[]        
        for token in token_list:            
            subTokens = [ x for x in token.split(key) if x.strip()]
            nList= nList + subTokens            
        token_list = nList  
    return token_list

score -1 · Answer 10 · answered Jan 18 '09 at 23:12

-1

I'd guess regular expressions are the way to go: http://docs.python.org/library/re.html

answered Jan 18 '09 at 23:12

Simon Groenewolt

10,607
1
36
64

How do I do what strtok() does in C, in Python?

10 Answers10

Linked