1

I am using SequenceMatcher to find a set of words within a group of texts. The problem I am having is that I need to record when it does not find a match, but one time per text. If I try an if statement, it gives me a result each time the comparison to another word fails.

names=[JOHN, LARRY, PETER, MARY]
files = [path or link]

  for file in files: 
     for name in names:
        if SequenceMatcher(None, name, file).ratio() > .9:
             do something
        else:
             print name + 'not found'

I have also tried re.match and re.find and I encounter the same problem. The code above is a simple version of what I am doing. I'm new to Python too. Thank you very much!

Connie
  • 13
  • 2
  • Can you clarify your question a bit? What should the output be if a word is found more than once? And if only once? And if it is not found at all? – mac Nov 21 '11 at 23:30
  • Yes. The output if a name is found is some information regarding that person that comes right after the name. Every person is mentioned only one time in a text, but not every person is in every text. If a person is not in a given text, I want to keep a record of that. The reason it is so important is because I am creating `csv ` file in which each column is a name. Does this help? Thanks! – Connie Nov 22 '11 at 00:07

2 Answers2

0

The simple way would be to keep track of matched names and not print them if they've already been printed:

seen = {}
for file in files:
    for name in names:
        if SequenceMatcher(None, name, file).ratio() > .9:
            do something
        elif name not in seen:
            seen[name] = 0
            print name + 'not found'
Dave
  • 3,834
  • 2
  • 29
  • 44
  • This worked! Thank you. Though I placed `seen=[]` between the first `for` and the second `for` so that it resets for each file. – Connie Nov 22 '11 at 00:25
0

If I interpret your comment to the question correctly (but I am not 100% sure!), this might illustrate the general mechanism you can follow:

>>> text = 'If JOHN would be married to PETER, then MARY would probably be unhappy'
>>> names = ['JOHN', 'LARRY', 'PETER', 'MARY']
>>> [text.find(name) for name in names]
[3, -1, 28, 40]  #This list will be always long as the names list

What I mean by "mechanism you can follow" is that SequenceMatcher (that I substituted with the builtin method find) should not only work as a test [True|False] but should already output the information you want to store.

HTH!

mac
  • 42,153
  • 26
  • 121
  • 131