0

I want to search from a dictionary if one of its words is in a second txt file. I have problem with the following code:

print 'Searching for known strings...\n'
with open('something.txt') as f:
    haystack = f.read()
with open('d:\\Users\\something\\Desktop\\something\\dictionary\\entirelist.txt') as f:
    for needle in (line.strip() for line in f):
        if needle in haystack:
            print line

The with open statements are not from me, I took them from: Search for strings listed in one file from another text file? I want to print the line so I wrote line instead of needle. Problems comes : it says line is not defined.

My final objective is to see if any words from a dictionary is in "something.txt", and if yes, print the line where the word was identified.

martineau
  • 119,623
  • 25
  • 170
  • 301
Maxim
  • 149
  • 2
  • 12
  • Can you give us an example (stripped down to, say, 3 lines) of what `something.txt` and `entirelist.txt` look like, and what output you want? – abarnert Oct 30 '14 at 00:56

2 Answers2

0

It looks like you've used a generator: (line.strip() for line in f), I don't think you can access the inner variables 'line' from outside the generator scope, i.e, outside the brackets.

Try something like:

for line in f:
    if line.strip() in haystack:
        print line
fileoffset
  • 956
  • 6
  • 9
  • Since `line.strip()` is just going to be a string (a line from the dictionary, with the newline removed), your `for needle in line.strip():` is just going to be each character in that line. So that can't possibly be right. – abarnert Oct 30 '14 at 01:11
0

The specific exception you asked about is because line doesn't exist outside the generator expression. If you want to access it, you need to keep it in the same scope as the print statement, like this:

for line in f:
    needle = line.strip()
    if needle in haystack:
        print line

But this isn't going to be particularly useful. It's just going to be the word from needle plus the newline at the end. If you want to print out the line (or lines?) from haystack that include needle, you have to search for that line, not just ask whether needle appears anywhere in the whole haystack.

To literally do what you're asking for, you're going to need to loop over the lines of haystack and check each one for needle. Like this:

with open('something.txt') as f:
    haystacks = list(f)

with open('d:\\Users\\something\\Desktop\\something\\dictionary\\entirelist.txt') as f:
    for line in f:
        needle = line.strip()
        for haystack in haystacks:
            if needle in haystack:
                print haystack

However, there's a neat trick you may want to consider: If you can write a regular expression that matches any complete line that includes needle, then you just need to print out all the matches. Like this:

with open('something.txt') as f:
    haystack = f.read()
with open('d:\\Users\\something\\Desktop\\something\\dictionary\\entirelist.txt') as f:
    for line in f:
        needle = line.strip()
        pattern = '^.*{}.*$'.format(re.escape(needle))
        for match in re.finditer(pattern, haystack, re.MULTILINE):
            print match.group(0)

Here's an example of how the regular expression works:

^.*Falco.*$

Regular expression visualization

Debuggex Demo

Of course if you want to search case-insensitively, or only search for complete words, etc., you'll need to make some minor changes; see the Regular Expression HOWTO, or a third-party tutorial, for more.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • 1
    Your second works well, the third prints something like : <_sre.SRE_Match object at 0x022854B8> Also I'm interested by searching case insensitively and for complete words, so I'll look about your link, try alone, and see what will happen :) Thanks for helping + offer alternatives :) – Maxim Oct 30 '14 at 22:01
  • @Maxim: Right, sorry, `finditer` returns `MatchObject`s, not just the matched strings. Which is a lot more useful, but if you're trying to see what's going on… well, I've edited the answer. To make regular expressions case-insensitive you can just add another flag (`re.MULTILINE | re.IGNORECASE`). To match only complete words, if you're lucky and `\b` has the same definition of word that you want, that's dead easy; otherwise it's a little more involved. Anyway, definitely play with things using Debuggex or another regex tool, it's a lot easier than the usual edit-debug cycle with source code. – abarnert Oct 30 '14 at 22:11
  • Thank you again! How do you do it to match complete words with the same definition? Also I would like to check if the str of the line has already be printed, and if yes, to not print. Possible? – Maxim Oct 30 '14 at 23:57
  • I added : `needle = ' ' + needle` It seems to works on 'semi complete words'. (Otherwise, if I wrote `needle = ' ' + needle + ' '` it would not count the first (not a problem) and the last (that's a problem) word.) – Maxim Oct 31 '14 at 14:09
  • @Maxim: Read up on what `\b` does. Assuming the regex definition of a word is close enough, `\bFalco\b` instead of just `Falco` will match the `Falco` in `So says Falco.` or `That's Falco's third album` but not in `The Falcon and the Snowman`. – abarnert Oct 31 '14 at 19:32