efficient way to get words before and after substring in text (python)

Question

I'm using regex to find occurrences of string patterns in a body of text. Once I find that the string pattern occurs, I want to get x words before and after the string as well (x could be as small as 4, but preferably ~10 if still as efficient).

I am currently using regex to find all instances, but occasionally it will hang. Is there a more efficient way to solve this problem?

This is the solution I currently have:

sub = r'(\w*)\W*(\w*)\W*(\w*)\W*(\w*)\W*(%s)\W*(\w*)\W*(\w*)\W*(\w*)\W*(\w*)' % result_string #refind string and get surrounding += 4 words
surrounding_text = re.findall(sub, text)
for found_text in surrounding_text:
  result_found.append(" ".join(map(str,found_text)))

You don't need to use a regex to get the window. Just find the position of the match and go from there. Though I'm not sure why it'd be slow, maybe the `*`s, and I'm not sure why you're using those btw. — keyser, Aug 08 '15 at 18:58
How strict is your definition of a "word". e.g. is `'foo-bar-baz'` 1 word or 3? Given your regex, it's 3, but if it can be 1, then your problem might simplify a bit ... — mgilson, Aug 08 '15 at 19:02
definition of a word is very loose. How does that simplify things? — user1253952, Aug 08 '15 at 19:04
Why do you need to individually capture surrounding words? And what input causes it to hang? — , Aug 08 '15 at 19:17
See [Why my Python regular expression pattern run so slowly?](http://stackoverflow.com/questions/27448200/why-my-python-regular-expression-pattern-run-so-slowly/27448381#27448381) on the speed matter. — ivan_pozdeev, Aug 08 '15 at 19:18
I guess I could capture characters if that's any easier, I just need to have some notion of what surrounds the matches. — user1253952, Aug 08 '15 at 19:29
Lesser error: you do not `re.escape` the string, thus condemning yourself to the hazards of regex metacharacters in it. — ivan_pozdeev, Aug 08 '15 at 19:31
Some problems: everything is optional, you could use cascading options and it will speed it up, but useing `\w` `\W` pairs still drags it down. — , Aug 08 '15 at 19:48

score 10 · Answer 1 · answered Aug 08 '15 at 22:26

I'm not sure if this is what you're looking for:

>>> text = "Hello, world. Regular expressions are not always the answer."
>>> words = text.partition("Regular expressions")
>>> words
('Hello, world. ', 'Regular expressions', ' are not always the answer.')
>>> words_before = words[0]
>>> words_before
'Hello, world. '
>>> separator = words[1]
>>> separator
'Regular expressions'
>>> words_after = words[2]
>>> words_after
' are not always the answer.'

Basically, str.partition() splits the string into a 3-element tuple. In this example, the first element is all of the words before the specific "separator", the second element is the separator, and the third element is all of the words after the separator.

I agree with the sentiment! Regular expressions are clearly *not* the answer here. But can you make something like this work for when `text` contains multiple matches? — LondonRob, Aug 10 '15 at 09:26

score 1 · Answer 2 · answered Aug 11 '15 at 02:49

The main problem with your pattern is that it begins with optional things that causes a lot of tries for each positions in the string until a match is found. The number of tries increases with the text size and with the value of n (the number of words before and after). This is why only few lines of text suffice to crash your code.

A way consists to begin the pattern with the target word and to use lookarounds to capture the text (or the words) before and after:

keyword (?= words after ) (?<= words before - keyword)

Starting a pattern with the searched word (a literal string) makes it very fast, and words around are then quickly found from this position in the string. Unfortunately the re module has some limitations and doesn't allow variable length lookbehinds (as many other regex flavors).

The new regex module supports variable length lookbehinds and other useful features like the ability to store the matches of a repeated capture group (handy to get the separated words in one shot).

import regex

text = '''In strange contrast to the hardly tolerable constraint and nameless
invisible domineerings of the captain's table, was the entire care-free
license and ease, the almost frantic democracy of those inferior fellows
the harpooneers. While their masters, the mates, seemed afraid of the
sound of the hinges of their own jaws, the harpooneers chewed their food
with such a relish that there was a report to it.'''

word = 'harpooneers'
n = 4

pattern = r'''
\m (?<target> %s ) \M # target word
(?<= # content before
    (?<before> (?: (?<wdb>\w+) \W+ ){0,%d} )
    %s
)
(?=  # content after
    (?<after>  (?: \W+ (?<wda>\w+) ){0,%d} )
)
''' % (word, n, word, n)

rgx = regex.compile(pattern, regex.VERBOSE | regex.IGNORECASE)

class Result(object):
    def __init__(self, m):
        self.target_span = m.span()
        self.excerpt_span = (m.starts('before')[0], m.ends('after')[0])
        self.excerpt = m.expandf('{before}{target}{after}')
        self.words_before = m.captures('wdb')[::-1]
        self.words_after = m.captures('wda')


results = [Result(m) for m in rgx.finditer(text)]

print(results[0].excerpt)
print(results[0].excerpt_span)
print(results[0].words_before)
print(results[0].words_after)
print(results[1].excerpt)

score 0 · Answer 3 · edited May 23 '17 at 12:14

Making a regex (well, anything, for that matter) with "as much repetitions as you will ever possibly need" is an extremely bad idea. That's because you

do an excessive amount of needless work every time
cannot really know for sure how much you will ever possibly need, thus introducing an arbitrary limitation

The bottom line for the below solutions: the 1st solution is the most effective one for large data; the 2nd one is the closest to your current, but scales much worse.

strip your entities to exactly what you are interested in at each moment:
- find the substring (e.g. str.index. For whole words only, re.find with e.g. r'\b%s\b'%re.escape(word) is more suitable)
- go N words back.
  - Since you mentioned a "text", your strings are likely to be very large, so you want to avoid copying potentially unlimited chunks of them.
  - E.g. re.finditer over a substring-reverse-iterator-in-place according to slices to immutable strings by reference and not copy and Best way to loop over a python string backwards. This will only become better than slicing when the latter is expensive in terms of CPU and/or memory - test on some realistic examples to find out. Doesn't work. re works directly with the memory buffer. Thus it's impossible to reverse a string for it without copying the data.
  - There's no function to find a character from a class in Python, nor an "xsplit". So the fastest way appears to be (i for i,c in enumerate(reversed(buffer(text,0,substring_index)) if c.isspace()) (timeit gives ~100ms on P3 933MHz for a full pass through a 100k string).

Alternatively:

Fix your regex to not be subject to catastrophic backtracking and eliminate code duplication (DRY principle).
The 2nd measure will eliminate the 2nd issue: we'll make the number of repetitions explicit (Python Zen, koan 2) and thus highly visible and manageable.
As for the 1st issue, if you really only need "up to known, same N" items in each case, you won't actually be doing "excessive work" by finding them together with your string.
- The "fix" part here is \w*\W* -> \w+\W+. This eliminates major ambiguity (see the above link) from the fact that each x* can be a blank match.
- Matching up to N words before the string effectively is harder:
  - with (\w+\W+){,10} or equivalent, the matcher will be finding every 10 words before discovering that your string doesn't follow them, then trying 9,8, etc. To ease it up on the matcher somewhat, \b before the pattern will make it only perform all this work at the beginning of each word
  - lookbehind is not allowed here: as the linked article explains, the regex engine must know how many characters to step back before trying the contained regex. And even if it was - a lookbehind is tried before every character - i.e. it's even more of a CPU hog
  - As you can see, regexes aren't quite cut to match things backwards
- To eliminate code duplication, either
  - use the aforementioned {,10}. This will not save individual words but should be noticeably faster for large text (see the above on how the matching works here). We can always parse the retrieved chunk of text in more details (with the regex in the next item) once we have it. Or
  - autogenerate the repetitive part
    - note that (\w+\W+)? repeated mindlessly is subject to the same ambiguity as above. To be unambiguous, the expression must be like this (w=(\w+\W+) here for brevity): (w(w...(ww?)?...)?)? (and all the groups need to be non-capturing).

score 0 · Answer 4 · answered May 30 '19 at 20:29

0

I personally think that using text.partition() is the best option, as it eliminates the messy regular expressions, and automatically leaves output in an easy-to-access tuple.

answered May 30 '19 at 20:29

The Daleks stand with Ukraine

427
5
20

efficient way to get words before and after substring in text (python)

4 Answers4

Linked