0

What i need is:

text_file_1.txt:
apple
orange
ice
icecream

text_file_2.txt:
apple
pear
ice

When i use "set", output will be:

apple
ice

("equivalent of re.match")

but I want to get:

apple
ice
icecream

("equivalent of re.search")

Is there any way how to do this? Files are large, so I can't just iterate over it and use regex.

hakre
  • 193,403
  • 52
  • 435
  • 836
Mephian
  • 5
  • 2
  • For formatting your text use SO's markdown syntax (look at the right of the page while editing, a help should pop up), not HTML tags. :) – mac Jul 07 '11 at 15:48
  • If you just want all words in B starting with a word in A: `{b for b in input_2 if any(a.startswith(b) for a in input_1}`. This will be O(n^2). Otherwise, could you post the code which you would like to run but is too slow? Then we can at least understand what you are trying to do. – Katriel Jul 07 '11 at 15:51

2 Answers2

2

you might want to check out difflib

  • Can you advice which function of difflib is the best one? – Mephian Jul 07 '11 at 16:07
  • @Mephian: until you define what do you mean by "similar match" it's impossible to answer this question. The standard `ratio()` function, for example, will **not** return the list that you asked for (try the code in my answer so you can check yourself). – mac Jul 07 '11 at 16:16
  • @mac did the legwork on that in his edited answer. You should probably give him the 'accepted answer' –  Jul 07 '11 at 16:18
  • @Paul - true sportmanship! [Although I don't think he can unless you edit your answer first, I believe there is a timelimit]. Unluckily I ran out of votes for today... I will +1 your answer tomorrow though! :) – mac Jul 07 '11 at 16:22
  • @mac: by similar match i mean word with same root. like "ice" - "icecream", or "icecream" - "strawberryicecream" and I am talking about milions of words in each file. – Mephian Jul 07 '11 at 16:24
  • @Mephian - If same root means "one is a substring of the other one" that it easy and probably faster to solve bypassing difflib. But if you mean that `iced` and `icecream` need to be hits, then difflib is a better choice, but you will have to look at the `get_opcodes()` and it will be **very** slow. As stuff like `ratio()` or `quick_ratio()` will report false positives like `cream` and `stream`.... – mac Jul 07 '11 at 16:33
1

If all you want is to extract from the files words which are one a substring of the other (including those that are identical) you could do:

fone = set(['apple', 'orange', 'ice', 'icecream'])
ftwo = set(['apple' ,'pear' ,'ice'])
# transforming to sets saves to check twice for the same combination

result = []
for wone in fone:
    for wtwo in ftwo:
        if wone.find(wtwo) != -1 or wtwo.find(wone) != -1:
            result.append(wone)
            result.append(wtwo)
for w in set(result):
    print w

Alternatively, if you want a similarity based on how strings are similar in the order of their letters, you could use as suggested by Paul in his answer one of the classes provided by difflib:

import difflib as dl

fone = set(['apple', 'orange', 'ice', 'icecream'])
ftwo = set(['apple' ,'pear' ,'ice'])

result = []
for wone in fone:
    for wtwo in ftwo:
        s = dl.SequenceMatcher(None, wone, wtwo)
        if s.ratio() > 0.6:  #0.6 is the conventional threshold to define "close matches"
            result.append(wone)
            result.append(wtwo)
for w in set(result):
    print w

I did not timed either of the two samples, but I would guess the second will run much slower, as for each couple you will have to instantiate an object...

mac
  • 42,153
  • 26
  • 121
  • 131
  • would be more efficient if you replaced the nested for-loop with itertools.product: `for wone, wtwo in itertools.product(fone, ftwo):` –  Jul 07 '11 at 16:03
  • @Paul - True, I thought to that myself, but I am under the impression (but I did not profile the code) that the 95% of the time will be spend on the `str.find()`'s rather than in the looping, so I considered it a *premature optimisation* (=futile). I might be wrong though, the only way is to time them! :) – mac Jul 07 '11 at 16:12
  • yea, I don't usually fall into the premature optimization trap, but nested loops have always (_always_) been a pet peeve of mine, ever since I learned c++, so when I took up Python and found itertools.product, I tend to reach for it instead of nested loops. –  Jul 07 '11 at 16:17
  • as to the problem of instantiating the `SequenceMatcher` object every time through the loop, I am curious whether this would be any more efficient: http://codepad.org/FFe7kM8L. I am setting up some scripts to test this right now. –  Jul 07 '11 at 16:23
  • if you're still interested, I have the test code and results at http://pastebin.com/G0fp25qu. long story short, updating a single object is about 20 seconds faster after 1000000 iterations, which isn't too surprising. what is surprising (to me) is that using itertools.product isn't faster at all :-/ –  Jul 07 '11 at 17:20