0

I want to check the percentage of numeric content in a particular string. For example,

Words = ['p2', 'p23','pp34','ppp01932','boss']

When the input is such as that, output should be,

output 
0.5
0.67
0.5
0.625
0.0

the quantification behind the output is, for p2, the number of numeric content is 1 and total length is 2. therefore 0.5. Likewise I want to find the output for all the entries.

I have tried the following,

float(sum(c.isdigit() for c in words[i])) / float(len(words[i]))

This is working fine but it is very inefficient and also when I run it using pyspark, I am getting errors such as jvm errors. I am looking for a efficient way to find this out so that I can run it in scale data for a data set of ~2 Billion records.

Any help would be appreciated.

Thanks

haimen
  • 1,985
  • 7
  • 30
  • 53

6 Answers6

3

it's worked for me you must use regular expression in python just import re and because re writed in c its speed is very good

 for i in Words:
    print float(len(''.join(re.findall('\d',i))))/float(len(i))

with re.findall('\d',i) you can find all numbers in each of your list's elements and with len() you can get size of it in according to the results if you have 1000 words with length ~100 or upper regex seems best way for you

keyvan vafaee
  • 464
  • 4
  • 15
  • regex is going to be _significantly_ slower than the simple string processing the OP is already doing – Hamms Aug 15 '17 at 21:24
  • @Hamms, tossing the needless `float()` stuff, I found that with short alphanumeric strings of 10 characters, the OP's code beats this regex-based solution. But when the strings grew to 100 characters, the regex-based solution significantly beats the OP's solution. I attribute the difference to iterating in Python vs. iterating in C once the strings are long enough to discount regex overhead. Did you get different timing results? – cdlane Aug 15 '17 at 22:36
  • some testing for me shows that regex is faster than OP's method for strings ~30 characters or longer, and faster than other, better list comprehensions for strings ~40 characters or longer, but significantly slower (on the other of twice as slow) for shorter strings. So the real answer is going to depend on the nature of OP's data – Hamms Aug 15 '17 at 23:23
  • So if length of string get longer regex is best way yes ? – keyvan vafaee Aug 15 '17 at 23:30
  • Actually after looking at approaches using `map`, `filter`, and `str.isdigit`, something like `map(lambda word: len(filter(str.isdigit, word))/float(len(word)), words)` is slightly better than regex for strings of ~100 characters and significantly better for shorter strings – Hamms Aug 15 '17 at 23:32
2

"Inefficient" is something you test for, not guess at. I ran several variations on this (isdigit(), re.sub(), etc.) and only 2 things were faster than your code: getting rid of the unnecessary float(), and not using the i index.

E.G.

import timeit

words = ['p2', 'p23','pp34','ppp01932','boss']

def isdigsub():
    for i in range(len(words)):
        float(sum(c.isdigit() for c in words[i])) / float(len(words[i]))

def isdigsub2():
    for i in range(len(words)):
        sum(c.isdigit() for c in words[i]) / len(words[i])

def isdigsub3():
    for w in words:
        sum(c.isdigit() for c in w) / len(w)

def isdigsub4():
    # From user Hamms
    for w in words:
        len([c for c in w if c.isdigit()]) / len(w)

if __name__ == '__main__':

    print(timeit.timeit('isdigsub()', setup="from __main__ import isdigsub", number=10000))
    print(timeit.timeit('isdigsub2()', setup="from __main__ import isdigsub2", number=10000))
    print(timeit.timeit('isdigsub3()', setup="from __main__ import isdigsub3", number=10000))
    print(timeit.timeit('isdigsub4()', setup="from __main__ import isdigsub4", number=10000))

On a pokey old Cubox produced:

0.7179876668378711
0.5230729999020696
0.4444526666775346
0.3233160013332963

Aaaand Hamms is in the lead with the best time so far. Barkeep! List comprehensions for everyone!

Peter Rowell
  • 17,605
  • 2
  • 49
  • 65
2

So many interesting approaches proposed here, and based on some fiddling around it looks like the relatives times of each can fluctuate quite a bit based on the lengths of the words being considered.

Let's grab some of the proposed solutions to test:

def original(words):
    [sum(c.isdigit() for c in word) / float(len(word)) for word in words]


def filtered_list_comprehension(words):
    [len([c for c in word if c.isdigit()]) / len(word) for word in words]


def regex(words):
    [len("".join(re.findall("\d", word))) / float(len(word)) for word in words]


def native_filter(words):
    [len(filter(str.isdigit, word)) / float(len(word)) for word in words]


def native_filter_with_map(words):
    map(lambda word: len(filter(str.isdigit, word))/float(len(word)), words)

And test them each with varying word lengths. Times are in seconds. Testing with 1000 words of length 10:

                    original:       1.976
 filtered_list_comprehension:       1.224
                       regex:       2.575
               native_filter:       1.209
      native_filter_with_map:       1.264

Testing with 1000 words of length 20:

                    original:       3.044
 filtered_list_comprehension:       2.032
                       regex:       3.205
               native_filter:       1.947
      native_filter_with_map:       2.034

Testing with 1000 words of length 30:

                    original:       4.115
 filtered_list_comprehension:       2.819
                       regex:       3.889
               native_filter:       2.708
      native_filter_with_map:       2.734

Testing with 1000 words of length 50:

                    original:       6.294
 filtered_list_comprehension:       4.313
                       regex:       4.884
               native_filter:       4.134
      native_filter_with_map:       4.171

Testing with 1000 words of length 100:

                    original:       11.638
 filtered_list_comprehension:       8.130
                       regex:       7.756
               native_filter:       7.858
      native_filter_with_map:       7.790

Testing with 1000 words of length 500:

                    original:       55.100
 filtered_list_comprehension:       38.052
                       regex:       28.049
               native_filter:       37.196
      native_filter_with_map:       37.209

From this I would conclude that if your "words" being tested can be up to 500 characters or so long, a regex seems to work well. Otherwise, filtering with str.isdigit seems to be the best approach for a variety of lengths.

Hamms
  • 5,016
  • 21
  • 28
0

Your code actually didn't work for me. This seems equivalent though, maybe it'll help.

words = ['p2', 'p23','pp34','ppp01932','boss']
map(lambda v: sum(v)/float(len(v)) , map(lambda v: map(lambda u: u.isdigit(), v),  words))
##[0.5, 0.6666666666666666, 0.5, 0.625, 0.0]
aku
  • 465
  • 4
  • 11
  • Although `map()` should be fast since it iterates in **C**, it's only a win if the code you pass it is also written in **C** -- if you pass it Python code (e.g. `lambda`), you don't get the anticipated speed. If you replace your `map(lambda u: u.isdigit(), v)` with `map(str.isdigit, v)` you should be able to measure the improvement when the test strings get up to around 100 characters in length. – cdlane Aug 15 '17 at 23:21
0

Try this:

Words = ['p2', 'p23','pp34','ppp01932','boss']

def get_digits(string):
    c = 0
    for i in string:
        if i.isdigit():
            c+=1
    return c
for item in Words:
    print(round(float(get_digits(item))/len(item), 2))

Note this has been addapted from Benjamin Wohlwends answer to this question

Stealing
  • 187
  • 1
  • 12
0

Hint: you can speed up your code by replacing builtin lookups with local name lookups.

This is the fastest solution for me:

def count(len=len):
    for word in words:
        len([c for c in word if c.isdigit()]) / len(word)

This is basically Hamms's filtered_list_comprehension / Peter's isdigsub4 with the len=len optimization.

With this trick your opcode will only use LOAD_FAST instead of LOAD_GLOBAL. This gave me a 3.6% speedup. Not much, but better than nothing.

Andrea Corbellini
  • 17,339
  • 3
  • 53
  • 69