Filtering out strings that only contains digits and/or punctuation - python

Question

I need to filter out only strings that contains only digits and/or a fix set of punctuation.

I've tried checking each character and then summing the Boolean conditions to check if it is equal to the len(str). Is there a more pythonic way to do this:

>>> import string
>>> x = ['12,523', '3.46', "this is not", "foo bar 42", "23fa"]
>>> [i for i in x if [True if j.isdigit() else False for j in i] ]
['12,523', '3.46', 'this is not', 'foo bar 42']
>>> [i for i in x if sum([True if j.isdigit() or j in string.punctuation else False for j in i]) == len(i)]
['12,523', '3.46']

Are you sure you don't **really** mean "I need to find strings that could represent numbers, but `float` etc. doesn't work because I also want to allow for commas as thousands separators"? — Karl Knechtel, Feb 11 '14 at 08:45
yes i will need that later but a two layer filtering will also capture numeric indexing in legal documents (e.g. `x = ["chapter", "1.2.3.5"]`) — alvas, Feb 11 '14 at 08:49

falsetru · Accepted Answer · 2014-02-11T10:57:40.837

14

Using all with generator expression, you don't need to count, compare length:

>>> [i for i in x if all(j.isdigit() or j in string.punctuation for j in i)]
['12,523', '3.46']

BTW, above and OP's code will include strings that contains only punctuations.

>>> x = [',,,', '...', '123', 'not number']
>>> [i for i in x if all(j.isdigit() or j in string.punctuation for j in i)]
[',,,', '...', '123']

To handle that, add more condition:

>>> [i for i in x if all(j.isdigit() or j in string.punctuation for j in i) and any(j.isdigit() for j in i)]
['123']

You can make it a little bit faster by storing the result of string.punctuation in a set.

>>> puncs = set(string.punctuation)
>>> [i for i in x if all(j.isdigit() or j in puncs for j in i) and any(j.isdigit() for j in i)]
['123']

edited Feb 11 '14 at 10:57

answered Feb 11 '14 at 08:31

falsetru

357,413
63
732
636

1

You could maybe make it a little bit faster by storing the result of `string.punctuation` in a `set` . – Frerich Raabe Feb 11 '14 at 10:51
1

@FrerichRaabe, Thank you for comment. I added your comment. – falsetru Feb 11 '14 at 10:57

thefourtheye · Answer 2 · 2014-02-11T10:37:43.260

You can use a pre-compiled regular expression to check this.

import re, string
pattern = re.compile("[\d{}]+$".format(re.escape(string.punctuation)))
x = ['12,523', '3.46', "this is not", "foo bar 42", "23fa"]
print [item for item in x if pattern.match(item)]

Output

['12,523', '3.46']

A little timing comparison, between @falsetru's solution and mine

import re, string
punct = string.punctuation
pattern = re.compile("[\d{}]+$".format(re.escape(string.punctuation)))
x = ['12,523', '3.46', "this is not", "foo bar 42", "23fa"]

from timeit import timeit
print timeit("[item for item in x if pattern.match(item)]", "from __main__ import pattern, x")
print timeit("[i for i in x if all(j.isdigit() or j in punct for j in i)]", "from __main__ import x, punct")

Output on my machine

2.03506183624
4.28856396675

So, the pre-compiled RegEx approach, is twice as fast as the all and any approach.

Filtering out strings that only contains digits and/or punctuation - python

2 Answers2