1

I'm getting false positives in python for the following example. I'm trying to find if a key word exist in a string. The problem is that the string has words connected by usually an underscore or hyphen so I only want positive result if the keyword exist when not in a word. It can be surround by hyphen,underscore or anything that is not a letter to be consider True result. Typically it should be surrounded by underscore or hyphen. It is not case sensitive as well.

test_list = ['server_test', 'server_dev', 'server_uat', 'server_dr', 'server-dr-NA', 'server-DR', 'dress_prod', 'testosterone','uatae','devacurl', 'dev_server']

The result should output this list of True/False

[True, True, True, True, True, True, False, False, False, False, True]

Implementation:

key_words = ['uat','dr','test','qa','dev']
for name in test_list:
     if any(x in name.lower() for x in key_words):
         print('True')
     else:
         print('False')

Results:

True
True
True
True
True
True
True
True
True
True  

Is there better way of doing this in python?

If not how would I do this using regex in python?

Please keep in mind this is being looped over a large data set where performance does matter.

MAXGEN
  • 735
  • 1
  • 10
  • 21

5 Answers5

2

Given:

>>> test_list = ['server_test', 'server_dev', 'server_uat', 'server_dr', 'server-dr-NA', 'server-DR', 'dress_prod', 'testosterone','uatae','devacurl', 'dev_server']
>>> key_words = ['uat','dr','test','qa','dev']

You can use re.split and any:

>>> [any(word.lower() in key_words for word in re.split(r'[^a-zA-Z]', s))
...     for s in test_list]
[True, True, True, True, True, True, False, False, False, False, True]

Which is the same as your target:

>>> tgt=[True, True, True, True, True, True, False, False, False, False, True]
>>> [any(word.lower() in key_words for word in re.split(r'[^a-zA-Z]', s))
...     for s in test_list]==tgt
True
dawg
  • 98,345
  • 23
  • 131
  • 206
1

Use negative lookbehind based regex.

>>> test_list = ['server_test', 'server_dev', 'server_uat', 'server_dr', 'server-dr-NA', 'server-DR', 'dress_prod', 'testosterone','uatae','devacurl', 'dev_server']
>>> key_words = ['uat','dr','test','qa','dev']
>>> [True if re.search(r'(?i)(?<![a-z])(?:' + '|'.join(key_words) + ')(?![a-z])', i) else False for i in test_list]
[True, True, True, True, True, True, False, False, False, False, True]
>>> 
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
0

Another method is to use \b to detect word boundaries. Unfortunately, _ is considered a word character, so we need to detect \b or _.

Not as terse or efficient as Avinash's solution, but possibly more readable.

import re

test_list = ['server_test', 'server_dev', 'server_uat', 'server_dr',
             'server-dr-NA', 'server-DR', 'dress_prod', 'testosterone',
             'uatae', 'devacurl', 'dev_server']

key_words = ['uat','dr','test','qa','dev']

for name in test_list:
    for kw in key_words:
        regex = r'(\b|_)'+kw+r'(\b|_)'
        if re.search(regex, name, re.IGNORECASE):
            print('True')
            break  # exit "for kw" loop
    else:  # only executed if "for kw" loop exits via exhaustion, not via break
        print('False')
Tom Zych
  • 13,329
  • 9
  • 36
  • 53
0

I think this pattern would be easy to understand and modify:

import re

pattern = r'.*(^|[^a-z])({names})([^a-z]|$).*'.format(names='|'.join(key_words))

# .*(^|[^a-z])(uat|dr|test|qa|dev)([^a-z]|$).*

for name in test_list:
    print(bool(re.search(pattern, name, re.IGNORECASE)))
Mikhail Gerasimov
  • 36,989
  • 16
  • 116
  • 159
0
import re

key_words = ['uat','dr','test','qa','dev']
test_list = ['server_test', 'server_dev', 'server_uat', 'server_dr', 'server-dr-NA', 
             'server-DR', 'dress_prod', 'testosterone','uatae','devacurl', 'dev_server']



def check(word):
    parts = re.split('[^a-z]', word.lower())
    return any(part in key_words for part in parts)

print([check(item) for item in test_list])
TessellatingHeckler
  • 27,511
  • 4
  • 48
  • 87