2

I have different length strings which have to be checked for substrings which match patterns of "tion", "ex", "ph", "ost", "ast", "ist" ignoring the case and the position i.e. prefix/suffix/middle of word. The matching words have to be returned in a new list rather than the matching substring element alone. With the below code I can return a new list of matching substring element without the full matching word.

def latin_ish_words(text):
    import re
    pattern=re.compile(r"tion|ex|ph|ost|ast|ist")
    matches=pattern.findall(text)
    return matches
latin_ish_words("This functions as expected")

With the results as follows:['tion', 'ex']

I was wondering how I could return the whole word rather than the matching substring element into a newlist?

tnsoom
  • 45
  • 3

3 Answers3

1

You can use

pattern=re.compile(r"\w*?(?:tion|ex|ph|ost|ast|ist)\w*")
pattern=re.compile(r"[a-zA-Z]*?(?:tion|ex|ph|ost|ast|ist)[a-zA-Z]*")
pattern=re.compile(r"[^\W\d_]*?(?:tion|ex|ph|ost|ast|ist)[^\W\d_]*")

The regex (see the regex demo) matches

  • \w*? - zero or more but as few as possible word chars
  • (?:tion|ex|ph|ost|ast|ist) - one of the strings
  • \w* - zero or more but as many as possible word chars

The [a-zA-Z] part will match only ASCII letters, and [^\W\d_] will match any Unicode letters.

Mind the use of the non-capturing group with re.findall, as otherwise, the captured substrings will also get their way into the output list.

If you need to only match letter words, and you need to match them as whole words, add word boundaries, r"\b[a-zA-Z]*?(?:tion|ex|ph|ost|ast|ist)[a-zA-Z]*\b".

See the Python demo:

import re
def latin_ish_words(text):
    import re
    pattern=re.compile(r"\w*?(?:tion|ex|ph|ost|ast|ist)\w*")
    return pattern.findall(text)
 
print(latin_ish_words("This functions as expected"))
# => ['functions', 'expected']
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

ignoring the case

pattern=re.compile(r"tion|ex|ph|ost|ast|ist")
matches=pattern.findall(text)

does not do that, consider following example

import re
pattern=re.compile(r"tion|ex|ph|ost|ast|ist")
text = "SCREAMING TEXT"
print(pattern.findall(text))

output

[]

despite there should be EX, you should add re.IGNORECASE flag like so

import re
pattern=re.compile(r"tion|ex|ph|ost|ast|ist", re.IGNORECASE)
text = "SCREAMING TEXT"
print(pattern.findall(text))

output

['EX']
Daweo
  • 31,313
  • 3
  • 12
  • 25
0

For a case insensitive match with whitspace boundaries you could use:

(?i)(?<!\S)\w*(?:tion|ex|ph|[oia]st)\w*(?!\S)

The pattern matches:

  • (?i) Inline modifier for a case insensitive match (Or use re.I)
  • (?<!\S) Assert a whitespace boundary to the left
  • \w* Match optional word characters
  • (?: Non capture group
    • tion|ex|ph|[oia]st Match either tion ex php or ost ist ast using a character class
  • ) Close non capture group
  • \w* Match optional word characters
  • (?!\S) Assert a whitespace boundary to the right

Regex demo | Python demo

def latin_ish_words(text):
    import re
    pattern = r"(?i)(?<!\S)\w*(?:tion|ex|ph|[oia]st)\w*(?!\S)"
    return re.findall(pattern, text)

print(latin_ish_words("This functions as expected"))

Output

['functions', 'expected']
The fourth bird
  • 154,723
  • 16
  • 55
  • 70