3

I have a list with string

fruits = ["apple", "orange", "grape"]

and a string that's missing whitespace

str = "ilikeapplesandoranges"

What's an efficient way of counting how many words from fruits exist in str?

For the example above, it would be 2, because we have ilike[apple]sandoranges and ilikeapplesand[orange]s

For a string like ilikebroccoli, it would be 0.

I know you can do str.count(word) but that's not efficient to do for each word in the list.

Algorithmatic
  • 1,824
  • 2
  • 24
  • 41

1 Answers1

1

It is hard to beat the C speed of str.count(word) actually.

I would use sum(your_string.count(e) for e in your_list) and be done with it.

Here is a quick benchmark of the answers here:

import time 

def li_count(s, li):
    return sum(s.count(e) for e in li)

import re 

def li_regex(s,li):
    return len(re.findall('|'.join(f'(?=({t}))' for t in li), s))

def li_split(s, li):
    return sum(len(s.split(e))-1 for e in li)

def cmpthese(funcs, args=(), cnt=10, rate=True, micro=True, deepcopy=True):
    from copy import deepcopy 
    """Generate a Perl style function benchmark"""                   
    def pprint_table(table):
        """Perl style table output"""
        def format_field(field, fmt='{:,.0f}'):
            if type(field) is str: return field
            if type(field) is tuple: return field[1].format(field[0])
            return fmt.format(field)     
        
        def get_max_col_w(table, index):
            return max([len(format_field(row[index])) for row in table])         
        
        col_paddings=[get_max_col_w(table, i) for i in range(len(table[0]))]
        for i,row in enumerate(table):
            # left col
            row_tab=[row[0].ljust(col_paddings[0])]
            # rest of the cols
            row_tab+=[format_field(row[j]).rjust(col_paddings[j]) for j in range(1,len(row))]
            print(' '.join(row_tab))                
            
    results={}
    for i in range(cnt):
        for f in funcs:
            if args:
                local_args=deepcopy(args)
                start=time.perf_counter_ns()
                f(*local_args)
                stop=time.perf_counter_ns()
            results.setdefault(f.__name__, []).append(stop-start)
    results={k:float(sum(v))/len(v) for k,v in results.items()}     
    fastest=sorted(results,key=results.get, reverse=True)
    table=[['']]
    if rate: table[0].append('rate/sec')
    if micro: table[0].append('\u03bcsec/pass')
    table[0].extend(fastest)
    for e in fastest:
        tmp=[e]
        if rate:
            tmp.append('{:,}'.format(int(round(float(cnt)*1000000.0/results[e]))))
            
        if micro:
            tmp.append('{:,.1f}'.format(results[e]/float(cnt)))
            
        for x in fastest:
            if x==e: tmp.append('--')
            else: tmp.append('{:.1%}'.format((results[x]-results[e])/results[e]))
        table.append(tmp) 
        
    pprint_table(table)                    
    
if __name__=='__main__':
    import sys
    print(sys.version)
    
    fruits = ["apple", "orange", "grape"]
    s = "ilikeapplesandorangesandapples"
    
    cases=(
        ('small list, small string', 1, 1),
        ('large list, small string', 2000, 1),
        ('small list, large string', 1, 2000),
        ('med list, med string', 500, 500)
    )
    for txt, x, y in cases:
        print(f'\n{txt}:')
        args=(s*y,fruits*x)
        cmpthese([li_count,li_regex, li_split],args)  

And that results are:

3.9.1 (default, Feb  3 2021, 07:38:02) 
[Clang 12.0.0 (clang-1200.0.32.29)]

small list, small string:
         rate/sec μsec/pass li_regex li_split li_count
li_regex      516   1,939.0       --   -92.4%   -93.4%
li_split    6,810     146.8  1220.4%       --   -12.5%
li_count    7,785     128.5  1409.4%    14.3%       --

large list, small string:
         rate/sec    μsec/pass li_regex li_split li_count
li_regex        0 39,162,529.8       --   -99.7%   -99.8%
li_split        9    106,144.2 36795.6%       --   -21.8%
li_count       12     83,023.3 47070.6%    27.8%       --

small list, large string:
         rate/sec μsec/pass li_regex li_split li_count
li_regex        2 453,577.1       --   -93.9%   -98.0%
li_split       36  27,754.5  1534.2%       --   -67.6%
li_count      111   8,985.6  4947.8%   208.9%       --

med list, med string:
         rate/sec       μsec/pass  li_regex li_split li_count
li_regex        0 1,166,950,320.1        --   -99.7%   -99.9%
li_split        0     3,418,900.8  34032.3%       --   -67.4%
li_count        1     1,114,163.6 104637.8%   206.9%       --
    

In all cases, the str.count method is the fastest -- usually by orders of magnitude.

dawg
  • 98,345
  • 23
  • 131
  • 206