2

I have an issue with computing md5sum. I have a recover tool -which archives file's metadata (inode) and also computes md5sum of them file(s) and stores them in sqlite db during installation. When the file gets removed/deleted . the tool recovers the deleted file using metadata from sqlite-db.It recovers file.Now ,I wanted to make sure recovered file is exactly same as original file.Thus recomputed the recovered files md5sum as shown below. The problem is ,strangely for few files,I can see (using cat) file content are exactly same (as before it was deleted) & stat command shows same output (except different inode number) but md5sum is different.

Following 2 files has same content - thus having different inode number doesn't affect md5sum.

764efa883dda1e11db47671c4a3bbd9e  /test/hi1.txt
764efa883dda1e11db47671c4a3bbd9e  /test/hi.txt

Any thoughts, how I should proceed with this?

char file_location[512] = {0};

char md5_cmd[512], md5sum[34];
FILE *pf;
//some recovery stuff goes here...

//Recompute md5  of recovered file
memset(md5_cmd, '\0', 512);
sprintf(md5_cmd, "md5sum %s", file_location);

pf = popen(md5_cmd, "r");
if (!pf) {
    fprintf(stderr,"Could not open pipe");
    return;
}

// get data
fgets(md5sum, 34, pf);

if (pclose(pf) != 0)
fprintf(stderr, "Error: close Failed.");

fprintf(stdout, "Md5sum is %s", md5sum);
MD XF
  • 7,860
  • 7
  • 40
  • 71
webminal.org
  • 44,948
  • 37
  • 94
  • 125
  • 3
    `,I can see (using cat)` What if there are stuff you cannot see ? Control characters, spaces vs tabs, newline at end of one file ? Do a hexdump on the files and compare the hex. – nos Nov 11 '11 at 09:03
  • 1
    Why is it a problem that files with the exact same content has the same MD5? – Jonas Elfström Nov 11 '11 at 09:04
  • 1
    Or just different encoding. MD5 works on the binary representation. – Vladislav Zorov Nov 11 '11 at 09:06
  • @nos, thanks thats a good suggestion. I'll try that first.@ Jonas,no the problem is sometimes files with exact same content has different md5sum. @ Thanks,Vladislav, I'll read more about its actual representation. – webminal.org Nov 11 '11 at 09:09
  • 1
    Another (very unpleasant) option is that the system has flaky memory and the md5sum process sometimes hits broken memory locations causing the "verified" data to wiggle around while calculating the sum. These things happen! There is /no/ good reason for an md5sum to change on the same input data. Its purpose is not to change. Vice versa, two different files indeed may have the same md5sum. That's called hash collision and is unlikely but possible and a natural property of hashing functions. – Tilman Vogel Nov 11 '11 at 09:52
  • Thanks Tilman,for your inputs. – webminal.org Nov 12 '11 at 03:10

1 Answers1

3

You cannot reliably compare file contents with cat. This way (unless you use cat -A or such), there can be many difference which go by unnoticed: spaces vs. tabs, whitespace at the end of lines, etc.

You should compare files with

diff -u fileA fileB

or

cmp fileA fileB

.

glglgl
  • 89,107
  • 13
  • 149
  • 217
  • 1
    Thanks guys,using hexdump and diff really shows the file is indeed different,they are (the contents are) not same as i thought. – webminal.org Nov 11 '11 at 09:20