Counting matches from a vocabulary file in a window surrounding a keyword

Question

For my research I am trying to count, from a corpus, the number of times (co-occurrence) a series of compound terms (e.g. Safety Hazard) stored in a file, 1 line per phrase, appear within a 16 word window of a target keyword (e.g. Facility). I am not a programmer, have been trying to break it into 2 elements: First extract a file from the corpus where I have a match on my target keyword, with the 8 words before and after. Then try and match my 'vocabulary file' to that extract. I am on part 1, have tried this, but I just get the <_sre.SRE_Match object at 0x028FFE78> message and am struggling trying to use repr: Any suggestions appreciated or other ways to do this. Ultimately I want an export file that has my vocabulary words with a count after them, indicating how often they have been found in that window with my target word. The use of re.search logic is based on what I have found on this message board which is why I tried it:

input=open("Corpus.txt", "r")
matches=[]
lines=input.readlines()
for line in lines:
  m=re.search(r'(\S+\s+){0,8}facility(\s+\S+){0,8}',line)
  if m:
    matches.append(m)
    for m in matches:
      output.write(str(m))
      output.close()

Any help appreciated, Paul

This kind of looks like python, minus indentation .... what language is it, mind adding that info to your tags? You also may want to clarify whether the 16 word window means "{8 words } {8 more words}" or whether this is a flexible window. — tink, Jun 03 '13 at 23:11
Thanks for reply tink. Sorry, my first post here. Yes it is Python, Added Python as a tag. The window is absolute in that it does not matter if words are repeated, I just need to grab the 8 words before the and 8 words after. — Paul, Jun 04 '13 at 06:57

lenz · Accepted Answer · 2013-06-04T19:55:38.103

1

Is your corpus already tokenized? You should really make sure it is.

Anyway, I think you are interested in the groups of the match object:

output.write(''.join(m.groups()) + '\n')

You will then find out that your groups will capture only the last word of each window. You need to put an extra pair of parentheses:

m = re.search(r'((?:\S+\s+){0,8})facility((?:\s+\S+){0,8})', line)

The (?:...) is a non-capturing group: it defines the scope of {0,8}, but it doesn't give you an extra group in the result.

Have a look at the Python's official RegEx Howto, or search the web for a RegEx tutorial. And in any case, maybe you should look for an off-the-shelf corpus tool, instead of re-inventing the wheel.

EDIT:
In order to match multiple occurrences of the keyword in one line, use re.findall() (returns a list) or re.finditer() (returns an iterator):

context = re.findall(r'((?:\S+\s+){0,8})facility((?:\s+\S+){0,8})', line)

context will be a list of pairs, ie. the left and the right window for every occurrence of the keyword. Note, however, that it will still not work if two occurrences of the same keyword are have less than 8 words between them, eg.

foo bar facility bla foo bar baz facility foo bar

will generate one match only, for the first occurrence of "facility", having the second one in its right window. The second "facility" will not generate a match of its own, since re.findall() doesn't do overlapping matches, which means that it will look for another "facility" only after the end of the right context. This also means that, if there are between 9 and 15 words inbetween, the second "facility"'s left window will be short of what the first one already consumed.

edited Jun 04 '13 at 19:55

answered Jun 04 '13 at 07:26

lenz

5,658
5
24
44

Thanks Lenz. Have been using off shelf tool LMOSS [link]http://www.indiana.edu/~clcl/LMOSS/ allows you to input a corpus, choose a search keyword paste the vocabulary terms you want to count occurrence in given word window of search term. Has been very useful. Unfortunately it only handles single words and I now need to test compounds words. Hence my foray into programming, do u know of any others? Been reading Regex and made the changes you suggested to re.search and groups. I now get text output, although some duplication, I'll need to investigate. – Paul Jun 04 '13 at 08:47
I'm not into co-occs too much, but I think you should google for a KWIC (keyword in context) tool, and look for one that allows you to search for a multi-word term. But, just reminding you again, if you think that that variants 'safety', 'Safety', 'safeties', 'safety.' etc. all belong to the same "word", you need to do some preprocessing to your corpus (a nice corpus tool might do this for you). – lenz Jun 04 '13 at 09:17
Thanks Lenz. The Python code above seems so close to what I need now, but it does not seem to be finding all occurences of the keyword on a line. For example, if there are less than 8 words before or after it, it does not capture anything. Any ideas? – Paul Jun 04 '13 at 11:26
If you expect more than one occurrence of the keyword in `line`, you can do the matching with `re.findall()`, or `re.finditer()`. – lenz Jun 04 '13 at 11:31
@len: If you edit the original answer to contain the findall that solved the problem you'll get an upvote. :) – tink Jun 04 '13 at 17:41

Counting matches from a vocabulary file in a window surrounding a keyword

1 Answers1