Comparing two .txt files in Python and saving exact and similar matches to .txt file

Question

What i need is:

text_file_1.txt:
apple
orange
ice
icecream

text_file_2.txt:
apple
pear
ice

When i use "set", output will be:

apple
ice

("equivalent of re.match")

but I want to get:

apple
ice
icecream

("equivalent of re.search")

Is there any way how to do this? Files are large, so I can't just iterate over it and use regex.

For formatting your text use SO's markdown syntax (look at the right of the page while editing, a help should pop up), not HTML tags. :) — mac, Jul 07 '11 at 15:48
If you just want all words in B starting with a word in A: `{b for b in input_2 if any(a.startswith(b) for a in input_1}`. This will be O(n^2). Otherwise, could you post the code which you would like to run but is too slow? Then we can at least understand what you are trying to do. — Katriel, Jul 07 '11 at 15:51

score 2 · Answer 1 · answered Jul 07 '11 at 15:48

2

you might want to check out difflib

answered Jul 07 '11 at 15:48

Can you advice which function of difflib is the best one? – Mephian Jul 07 '11 at 16:07
@Mephian: until you define what do you mean by "similar match" it's impossible to answer this question. The standard `ratio()` function, for example, will **not** return the list that you asked for (try the code in my answer so you can check yourself). – mac Jul 07 '11 at 16:16
@mac did the legwork on that in his edited answer. You should probably give him the 'accepted answer' – Jul 07 '11 at 16:18
@Paul - true sportmanship! [Although I don't think he can unless you edit your answer first, I believe there is a timelimit]. Unluckily I ran out of votes for today... I will +1 your answer tomorrow though! :) – mac Jul 07 '11 at 16:22
@mac: by similar match i mean word with same root. like "ice" - "icecream", or "icecream" - "strawberryicecream" and I am talking about milions of words in each file. – Mephian Jul 07 '11 at 16:24
@Mephian - If same root means "one is a substring of the other one" that it easy and probably faster to solve bypassing difflib. But if you mean that `iced` and `icecream` need to be hits, then difflib is a better choice, but you will have to look at the `get_opcodes()` and it will be **very** slow. As stuff like `ratio()` or `quick_ratio()` will report false positives like `cream` and `stream`.... – mac Jul 07 '11 at 16:33

mac · Accepted Answer · 2011-07-07T16:34:09.703

1

If all you want is to extract from the files words which are one a substring of the other (including those that are identical) you could do:

fone = set(['apple', 'orange', 'ice', 'icecream'])
ftwo = set(['apple' ,'pear' ,'ice'])
# transforming to sets saves to check twice for the same combination

result = []
for wone in fone:
    for wtwo in ftwo:
        if wone.find(wtwo) != -1 or wtwo.find(wone) != -1:
            result.append(wone)
            result.append(wtwo)
for w in set(result):
    print w

Alternatively, if you want a similarity based on how strings are similar in the order of their letters, you could use as suggested by Paul in his answer one of the classes provided by difflib:

import difflib as dl

fone = set(['apple', 'orange', 'ice', 'icecream'])
ftwo = set(['apple' ,'pear' ,'ice'])

result = []
for wone in fone:
    for wtwo in ftwo:
        s = dl.SequenceMatcher(None, wone, wtwo)
        if s.ratio() > 0.6:  #0.6 is the conventional threshold to define "close matches"
            result.append(wone)
            result.append(wtwo)
for w in set(result):
    print w

I did not timed either of the two samples, but I would guess the second will run much slower, as for each couple you will have to instantiate an object...

edited Jul 07 '11 at 16:34

answered Jul 07 '11 at 15:56

mac

42,153
26
121
131

would be more efficient if you replaced the nested for-loop with itertools.product: `for wone, wtwo in itertools.product(fone, ftwo):` – Jul 07 '11 at 16:03
@Paul - True, I thought to that myself, but I am under the impression (but I did not profile the code) that the 95% of the time will be spend on the `str.find()`'s rather than in the looping, so I considered it a *premature optimisation* (=futile). I might be wrong though, the only way is to time them! :) – mac Jul 07 '11 at 16:12
yea, I don't usually fall into the premature optimization trap, but nested loops have always (_always_) been a pet peeve of mine, ever since I learned c++, so when I took up Python and found itertools.product, I tend to reach for it instead of nested loops. – Jul 07 '11 at 16:17
as to the problem of instantiating the `SequenceMatcher` object every time through the loop, I am curious whether this would be any more efficient: http://codepad.org/FFe7kM8L. I am setting up some scripts to test this right now. – Jul 07 '11 at 16:23
if you're still interested, I have the test code and results at http://pastebin.com/G0fp25qu. long story short, updating a single object is about 20 seconds faster after 1000000 iterations, which isn't too surprising. what is surprising (to me) is that using itertools.product isn't faster at all :-/ – Jul 07 '11 at 17:20

Comparing two .txt files in Python and saving exact and similar matches to .txt file

2 Answers2