How To Identify Files with Identical Content But a Different Arrangement of the Data

Question

I'm testing an upgrade we ran on an application that processes data. I took archived data that has already run through the system before and comparing it with output from the newly upgraded application. I'm noticing that the data is the same but the arrangement of the data in the new output is different. For example, in the new file line 57's data used to be on line 43 in the old output. Is there a way to detect that the files contain identical content? When I run a file compare in TextPad or do an MD5 hash compare, it doesn't detect that the files have the same content. It sees them as different files.

First sort the files, then compare them. – Dominique Oct 25 '18 at 13:18 — Dominique, Oct 25 '18 at 13:18

Enak · Answer 1 · 2018-10-25T13:25:14.343

A hash compare is meaningless. Since e.g. two files with

foo
bar

and

bar
foo

would generate a completly different hash. Otherwise hash functions would be really broken.

I think your only chance here is to look if every line in file A is in file B (line by line). Maybe you could implement a sort algorithm. This could be done concurrent on both files and then you could compare the hash of these two files since the sort algorithm is deterministic in its output.

score 1 · Accepted Answer · answered Oct 25 '18 at 13:33

As Enak and Dominique have mentioned, sorting text files line by line and then comparing the two will reveal with complete certainty if anything is missing or not.

You might calculate some aggregate values of both files and compare them for sufficient proof though, which will be a lot faster. Are the number of words and characters the same? What about the number of different alphabets? Count all 26 alphabets in both files (you could also do the same for any character set of your choice), if their numbers match up exactly, there is a very high probability that both files contain the same information. This is on the same lines as your hashing approach, but obviously isn't as reliable.

If you need to know with certainty, you will have to compare each line of file A with each line of file B somehow. If the lines are completely shuffled, sorting the lines in file A and B and then comparing the files will be the best option. If there is locality however (line number x of file A tends to stay around location x in file B), you might as well just compare the two files without sorting, but rather by starting your search for line x of file A around location x in file B.

I chose to go with your value count suggestion. It's not as tight as a hash, but it will do for our testing. I'll couple the script with some manual smoke tests and that should be enough. Thanks for the suggestion. — acecabana, Oct 25 '18 at 16:41

How To Identify Files with Identical Content But a Different Arrangement of the Data

2 Answers2