0

I'm trying to compare 2 files on a Solaris box and only see the lines that are not similar. I know that I can use the command given below to find lines that are not exact matches, but that isn't good enough for what I'm try to do.

comm -12 <(sort FILE1.txt | uniq) <(sort FILE2.txt | uniq) > diff.txt

For the purposes of this question I would define simlar as having the same characters ~80% of the time, but completely ignoring locations that differ (since the sections that differ may also differ in length). The locations that differ can be assumed to occur at roughly the same point in the line. In other words once we find a location that differs we have to figure out when to start comparing again.

I know this is a hard problem to solve and will appreciate any help/ideas.

EDIT:

Example input 1:

Abend for SP EAOJH with account s03284fjw and client pewaj39023eipofja,level.error
Exception Invalid account type requested: 134029830198,level.fatal
Only in file 1

Example input 2:

Exception Invalid account type requested: 1307230,level.fatal
Abend for SP EREOIWS with account 32192038409aoewj and client eowaji30948209,level.error

Example output:

Only in file 1

I am also realizing that it would be ideal if the files were not read into memory all at once since they can be nearly 100 gigs. Perhaps perl would be better than bash because of this need.

d-_-b
  • 21,536
  • 40
  • 150
  • 256
zwy
  • 1
  • 2
  • we need some sample data and expected output. Use the `{}` editing tool at the top left of the input box to keep the format of your data. Good luck. – shellter Jun 21 '13 at 14:41
  • Also consider that there special algorithms to determine similarity. I don't know how many are available as a UNIX tool. So you may have to do some coding. Example: [Levenshtein distance](http://en.wikipedia.org/wiki/Levenshtein_distance) – jim mcnamara Jun 21 '13 at 15:34
  • Looks like I may be able to just use a perl implementation of Levenshtein distance (http://cpansearch.perl.org/src/JGOLDBERG/Text-Levenshtein-0.05/Levenshtein.pm) – zwy Jun 25 '13 at 03:24

0 Answers0