0

well i need to compare two strings or at least find a sequence of characters from a string to another string. The two strings contain md5 of files which i must compare and say if i find a match.

my current code is:

def comparemd5():
    origmd5=getreferrerurl()
    dlmd5=md5_for_file(file_name)
    print "original md5 is",origmd5
    print "downloader file md5 is",dlmd5
    s = difflib.SequenceMatcher(None, origmd5, dlmd5)
    print "ratio is:",s.ratio()

the output i get is:

original md5 is ['0430f244a18146a0815aa1dd4012db46', '0430f244a18146a0815aa1dd40
12db46', '59739CCDA2F15D5AC16DB6695CAE3378']

downloader file md5 is 59739ccda2f15d5ac16db6695cae3378

ratio is : 0.0

Thus! there is a match from dlmd5 in origmd5 but somehow its not finding it... I am doing something wrong somewhere...Please help me out :/

Chris Tang
  • 567
  • 7
  • 18
scandalous
  • 912
  • 5
  • 14
  • 25
  • when matching md5 hashes is it important HOW off the hash is? if hashes don't match file's don't match? – dm03514 Mar 14 '13 at 18:22
  • dlmd5 is list not string for comparing. – iMom0 Mar 14 '13 at 18:23
  • maybe i don't know how this works, but why can't you just do `if dlmd5 in origmd5` – Hoopdady Mar 14 '13 at 18:24
  • dlmd5 is a list? It does appear that way when its printed. – Hoopdady Mar 14 '13 at 18:25
  • Two files with very little difference yield VERY different md5 hashes. – MGP Mar 14 '13 at 18:25
  • oh. Its uppercase is why its not matching – Hoopdady Mar 14 '13 at 18:26
  • I need to make python compare the two at least and say that it "found" one match in the original md5 list...that origmd5 CONTAINS dlmd5 as it is searching for the md5 on a website... – scandalous Mar 14 '13 at 18:26
  • It looks like what you actually have is two strings that each contain an MD5 in slightly different formats, one of which also contains something else after the MD5, and you want to know if the MD5 they contain is the same, right? – abarnert Mar 14 '13 at 18:27

3 Answers3

0

Basically, you want the idom if test_string in list_of_strings. Looks like you don't need case sensitivity, so you might want

if test_string.lower() in (s.lower() for s in list_of_strings)

In your case:

>>> originals = ['0430f244a18146a0815aa1dd4012db46', '0430f244a18146a0815aa1dd40 12db46', '59739CCDA2F15D5AC16DB6695CAE3378']
>>> test = '59739ccda2f15d5ac16db6695cae3378'
>>> if test.lower() in (s.lower() for s in originals):
...    print '%s is match, yeih!' % test
... 
59739ccda2f15d5ac16db6695cae3378 is match, yeih!
Dr. Jan-Philip Gehrcke
  • 33,287
  • 14
  • 85
  • 130
0

Looks like you're having a problem since the case isn't matching on the letters. May want to try:

def comparemd5():
    origmd5=[item.lower() for item in getreferrerurl()]
    dlmd5=md5_for_file(file_name)
    print "original md5 is",origmd5
    print "downloader file md5 is",dlmd5
    s = difflib.SequenceMatcher(None, origmd5, dlmd5)
    print "ratio is:",s.ratio()
Hoopdady
  • 2,296
  • 3
  • 25
  • 40
  • hmmm seems like it's still not finding any match, ratio still 0 ...maybe because its part of a list ? and comparing only with the first item its getting ? – scandalous Mar 14 '13 at 19:08
  • i found some way to workaround it Hoopdady, i merged the origmd5 into a single string using join and then compared. now it works! thanks for the heads up you helped me greatly ! – scandalous Mar 15 '13 at 08:47
0

Given the input:

original md5 is ['0430f244a18146a0815aa1dd4012db46', '0430f244a18146a0815aa1dd40 12db46', '59739CCDA2F15D5AC16DB6695CAE3378']

downloader file md5 is 59739ccda2f15d5ac16db6695cae3378

You have two problems.

First of all, that first one isn't just an MD5, but an MD5 and two other things.

To fix that: If you know that origmd5 will always be in this format, just use origmd5[2] instead of origmd5. If you have no idea what origmd5 is, except that one of the things in it is the actual MD5, you'll have to compare against all of the elements.

Second, the actual MD5 values are both hex strings representing the same binary data, but they're different hex strings (because one is in uppercase, the other in lowercase). You could fix this by just doing a case-insensitive comparison, but it's probably more robust to unhexlify them both and compare the binary values.

In fact, if you've copied and pasted the output correctly, at least one of those hex strings has a space in the middle of it, so you actually need to unhexlify hex strings with optional spaces between hex pairs. AFAIK, there is no stdlib function that does this, but you can write it yourself in one step:

def unhexlify(s):
    return binascii.unhexlify(s.replace(' ', ''))

Meanwhile, I'm not sure why you're trying to use difflib.SequenceMatcher at all. Two slightly different MD5 hashes refer to completely different original sources; that's kind of the whole point of MD5, and crypto hash functions in general. There's no such thing as a 95% match; there's either a match, or a non-match.

So, if you know the 3rd value in origmd5 is the one you want, just do this:

s = unhexlify(origmd5[2]) == unhexlify(dlmd5)

Otherwise, do this:

s = any(unhexlify(origthingy) == unhexlify(dlmd5) for origthingy in origmd5)

Or, turning it around to make it simpler:

s = unhexlify(dlmd5) in map(unhexlify, origthingy)

Or whatever equivalent you find most readable.

Community
  • 1
  • 1
abarnert
  • 354,177
  • 51
  • 601
  • 671