5

I want to check whether a particular string is present in a sentence. I am using simple code for this purpose

subStr = 'joker'
Sent = 'Hello World I am Joker'

if subStr.lower() in Sent.lower():
    print('found')

This is an easy straightforward approach, but it fails when sentence appears as

hello world I am Jo ker

hello world I am J oker

As I am parsing sentence from a PDF file some unnecessary spaces are coming here and there.

A simple approach to tackle this issue would be to remove all the spaces from a sentence and look for a substring match. I want to know other peoples thoughts on this, should I stick with this approach or look for some other alternatives.

jpp
  • 159,742
  • 34
  • 281
  • 339
Olivia Brown
  • 594
  • 4
  • 15
  • 28
  • 4
    How would differentiate between "to day" and "today" if your input has arbitrary spacing? – jpp Feb 06 '18 at 13:48

4 Answers4

2

you can use regular expression:

import re
word_pattern = re.compile(r'j\s*o\s*k\s*e\s*r', re.I)
sent = 'Hello World I am Joker'
if word_pattern.search(sent):
    print('found')

I hope this works

Tzomas
  • 704
  • 5
  • 17
2

This is more efficient than replace for small strings, more expensive for large strings. It won't deal with ambiguous cases, e.g. 'to day' vs 'today'.

subStr in ''.join(Sent.split()).lower()  # True
jpp
  • 159,742
  • 34
  • 281
  • 339
  • A quick test suggests this is about half as fast as using `subStr Sent.replace(' ', '').lower()` - a bit better if Sent is very short, a bit worse if it's very long. – Nathan Vērzemnieks Feb 07 '18 at 03:12
  • @NathanVērzemnieks, I will update my answer. My test was for small strings, as per OP. – jpp Feb 07 '18 at 09:10
0

Try this. This may break somewhere unexpectedly. But for your use case this might work

In [1]: Sent = 'Hello World I am Joker'

In [3]: subStr = 'Joker'

In [4]: if subStr in Sent.replace(' ', ''):
   ...:     print("Do something")
   ...:     
Do something
Arpit Solanki
  • 9,567
  • 3
  • 41
  • 57
0

Your proposed approach - removing spaces - seems straightforward and efficient (two to ten times faster than the other suggestions, in some simple tests). If you need to minimize false positives, though, you might be better off with the regular expression approach. You could add word boundaries to avoid partial word matches, and examine the matching substring to see if any spaces could be real spaces, perhaps by matching against a canonical word list.

>>> sentence = 'Were the fields ever green? - they were never green.'
>>> target = 'evergreen'
>>> pattern = re.compile(r'\b' + '\s*'.join(target) + r'\b')
>>> pattern.findall(sentence) # only one match because of \b
['ever green']
>>> matching_words = pattern.findall(sentence)[0].split()
>>> all(word in dictionary for word in matching_words)
True
Nathan Vērzemnieks
  • 5,495
  • 1
  • 11
  • 23