9

I am learning Python and trying to figure out an efficient way to tokenize a string of numbers separated by commas into a list. Well formed cases work as I expect, but less well formed cases not so much.

If I have this:

A = '1,2,3,4'
B = [int(x) for x in A.split(',')]

B results in [1, 2, 3, 4]

which is what I expect, but if the string is something more like

A = '1,,2,3,4,'

if I'm using the same list comprehension expression for B as above, I get an exception. I think I understand why (because some of the "x" string values are not integers), but I'm thinking that there would be a way to parse this still quite elegantly such that tokenization of the string a works a bit more directly like strtok(A,",\n\t") would have done when called iteratively in C.

To be clear what I am asking; I am looking for an elegant/efficient/typical way in Python to have all of the following example cases of strings:

A='1,,2,3,\n,4,\n'
A='1,2,3,4'
A=',1,2,3,4,\t\n'
A='\n\t,1,2,3,,4\n'

return with the same list of:

B=[1,2,3,4]

via some sort of compact expression.

Rex M
  • 142,167
  • 33
  • 283
  • 313
Tall Jeff
  • 9,834
  • 7
  • 44
  • 61

10 Answers10

30

How about this:

A = '1, 2,,3,4  '
B = [int(x) for x in A.split(',') if x.strip()]

x.strip() trims whitespace from the string, which will make it empty if the string is all whitespace. An empty string is "false" in a boolean context, so it's filtered by the if part of the list comprehension.

Dave Ray
  • 39,616
  • 7
  • 83
  • 82
  • strip() is overkill here, converting to int already takes care of the whitespace... better filter out the invalid substrings. – Algorias Jan 19 '09 at 02:00
  • strip() is necessary, or you will get a ValueError: invalid literal for int() with base 10: ' ' – Ryan Chen Apr 28 '21 at 14:12
4

For the sake of completeness, I will answer this seven year old question: The C program that uses strtok:

int main()
{
    char myLine[]="This is;a-line,with pieces";
    char *p;
    for(p=strtok(myLine, " ;-,"); p != NULL; p=strtok(NULL, " ;-,"))
    {
        printf("piece=%s\n", p);
    }
}

can be accomplished in python with re.split as:

import re
myLine="This is;a-line,with pieces"
for p in re.split("[ ;\-,]",myLine):
    print("piece="+p)
user1683793
  • 1,213
  • 13
  • 18
4

Generally, I try to avoid regular expressions, but if you want to split on a bunch of different things, they work. Try this:

import re
result = [int(x) for x in filter(None, re.split('[,\n,\t]', A))]
Nick
  • 21,555
  • 18
  • 47
  • 50
4

Mmm, functional goodness (with a bit of generator expression thrown in):

a = "1,2,,3,4,"
print map(int, filter(None, (i.strip() for i in a.split(','))))

For full functional joy:

import string
a = "1,2,,3,4,"
print map(int, filter(None, map(string.strip, a.split(','))))
Alec Thomas
  • 19,639
  • 4
  • 30
  • 24
  • 1
    `print map(int, filter(len, map(str.strip, a.split(','))))` Note: str.strip(i) and i.strip() are the same (no need in `string` module). `len` is used for readability. – jfs Jan 19 '09 at 19:35
  • I prefer `string.strip()` as it safely deals with Unicode strings. – Alec Thomas Jan 20 '09 at 01:58
  • @Alec: `map(type(a).strip, a.split(','))` will work both for Unicode and encoded strings. – jfs Jan 20 '09 at 04:16
1

This will work, and never raise an exception, if all the numbers are ints. The isdigit() call is false if there's a decimal point in the string.

>>> nums = ['1,,2,3,\n,4\n', '1,2,3,4', ',1,2,3,4,\t\n', '\n\t,1,2,3,,4\n']
>>> for n in nums:
...     [ int(i.strip()) for i in n if i.strip() and i.strip().isdigit() ]
... 
[1, 2, 3, 4]
[1, 2, 3, 4]
[1, 2, 3, 4]
[1, 2, 3, 4]
runeh
  • 3,771
  • 1
  • 21
  • 16
1

How about this?

>>> a = "1,2,,3,4,"
>>> map(int,filter(None,a.split(",")))
[1, 2, 3, 4]

filter will remove all false values (i.e. empty strings), which are then mapped to int.

EDIT: Just tested this against the above posted versions, and it seems to be significantly faster, 15% or so compared to the strip() one and more than twice as fast as the isdigit() one

Algorias
  • 3,043
  • 5
  • 22
  • 16
1

Why accept inferior substitutes that cannot segfault your interpreter? With ctypes you can just call the real thing! :-)

# strtok in Python
from ctypes import c_char_p, cdll

try: libc = cdll.LoadLibrary('libc.so.6')
except WindowsError:
     libc = cdll.LoadLibrary('msvcrt.dll')

libc.strtok.restype = c_char_p
dat = c_char_p("1,,2,3,4")
sep = c_char_p(",\n\t")
result = [libc.strtok(dat, sep)] + list(iter(lambda: libc.strtok(None, sep), None))
print(result)
joeforker
  • 40,459
  • 37
  • 151
  • 246
  • added `cdll.LoadLibrary('msvcrt.dll')` – jfs Jan 19 '09 at 20:05
  • When I saw the `while` loop I'd thought of `lambda tokenize dat, sep: itertools.chain((strtok(dat, sep),), iter(lambda: strtok(None, sep), None))`. It is funny how similar programmers' minds work. Add the smile emoticon back otherwise somebody can think that the code is not a joke. – jfs Jan 21 '09 at 19:44
0

Why not just wrap in a try except block which catches anything not an integer?

Josh Smeaton
  • 47,939
  • 24
  • 129
  • 164
0

I was desperately in need of strtok equivalent in Python. So I developed a simple one by my own

def strtok(val,delim):
    token_list=[]
    token_list.append(val)  
    for key in delim:       
        nList=[]        
        for token in token_list:            
            subTokens = [ x for x in token.split(key) if x.strip()]
            nList= nList + subTokens            
        token_list = nList  
    return token_list
-1

I'd guess regular expressions are the way to go: http://docs.python.org/library/re.html

Simon Groenewolt
  • 10,607
  • 1
  • 36
  • 64