Comparing many files in Bash

Question

I'm trying to automate a task at work that I normally do by hand, that is taking database output from the permissions of multiple users and comparing them to see what they have in common. I have a script right now that uses comm and paste, but it's not giving me all the output I'd like.

Part of the problem comes in comm only dealing with two files at once, and I need to compare at least three to find a trend. I also need to determine if two out of the three have something in common, but the third one doesn't have it (so comparing the output of two comm commands doesn't work). I need these in comma separated values so it can be imported into Excel. Each user has a column, and at the end is a listing of everything they have in common. comm would work perfectly if it could compare more than two files (and show two-out-of-three comparisons).

In addition to the code I have to clean all the extra cruft off the raw csv file, here's what I have so far in comparing four users. It's highly inefficient, but it's what I know.

cat foo1 | sort > foo5
cat foo2 | sort > foo6
cat foo3 | sort > foo7
cat foo4 | sort > foo8

comm foo5 foo6 > foomp
comm foo7 foo8 > foomp2

paste foomp foomp2 > output2
sed 's/[\t]/,/g' output2 > output4.csv
cat output4.csv

Right now this outputs two users, their similarities and differences, then does the same for another two users and pastes it together. This works better than doing it by hand, but I know I could be doing more.

An example input file would be something like:

User1

Active Directory
Internet
S: Drive
Sales Records

User2

Active Directory
Internet
Pricing Lookup
S: Drive

User3

Active Directory
Internet
Novell
Sales Records

where they have AD and Internet in common, two out of three have sales records access and S: drive permission, only one of each has Novell and Pricing access.

Can someone give me a hand in what I'm missing?

score 1 · Answer 1 · answered May 01 '12 at 13:31

1

You can use the diff3 program. From the man page:

   diff3 - compare three files line by line

Given your sample inputs, above, running diff3 results in:

====
1:3,4c
  S: Drive
  Sales Records
2:3,4c
  Pricing Lookup
  S: Drive
3:3,4c
  Novell
  Sales Records

Does this get you any closer to what you're looking for?

answered May 01 '12 at 13:31

larsks

277,717
41
399
399

I actually wrote up a script using diff3, until I ran into one where the boss said "compare these FOUR users now!" – freehunter May 01 '12 at 19:31

Dennis Williamson · Accepted Answer · 2012-05-01T21:36:42.393

Using GNU AWK (gawk) you can print a table that shows how multiple users' permissions correlate. You could also do the same thing in any language that supports associative arrays (hashes), such as Bash 4, Python, Perl, etc.

#!/usr/bin/awk -f
{
    array[FILENAME, $0] = $0
    perms[$0] = $0
    if (length($0) > maxplen) {
        maxplen = length($0)
    }
    users[FILENAME] = FILENAME
}
END {
    pcount = asort(perms)
    ucount = asort(users)
    maxplen += 2
    colwidth = 8
    printf("%*s", maxplen, "")
    for (u = 1; u <= ucount; u++) {
        printf("%-*s", colwidth, users[u])
    }
    printf("\n")

    for (p = 1; p <= pcount; p++) {
        printf("%-*s", maxplen, perms[p])
        for (u = 1; u <= ucount; u++) {
            if (array[users[u], perms[p]]) {
                printf("%-*s", colwidth, "  X")
            } else {
                printf("%-*s", colwidth, "")
            }
        }
    printf("\n")
    }
}

Save this file, perhaps calling it "correlate", then set it to be executable:

$ chmod u+x correlate

Then, assuming that the filenames correspond to the usernames or are otherwise meaningful (your examples are "user1" through "user3" so that works well), you can run it like this:

$ ./correlate user*

and you would get the following output based on your sample input:

                  user1   user2   user3
Active Directory    X       X       X
Internet            X       X       X
Novell                              X
Pricing Lookup              X
S: Drive            X       X
Sales Records       X               X

Edit:

This version doesn't use asort() and so it should work on non-GNU versions of AWK. The disadvantage is that the order of rows and columns is unpredictable.

#!/usr/bin/awk -f
{
    array[FILENAME, $0] = $0
    perms[$0] = $0
    if (length($0) > maxplen) {
        maxplen = length($0)
    }
    users[FILENAME] = FILENAME
}
END {
    maxplen += 2
    colwidth = 8
    printf("%*s", maxplen, "")
    for (u in users) {
        printf("%-*s", colwidth, u)
    }
    printf("\n")

    for (p in perms) {
        printf("%-*s", maxplen, p)
        for (u in users) {
            if (array[u, p]) {
                printf("%-*s", colwidth, "  X")
            } else {
                printf("%-*s", colwidth, "")
            }
        }
    printf("\n")
    }
}

I'll have to try this when I get home. Obviously there's something there in that script but I don't know awk well enough to get it working right. For me it says: awk: ./coorelate: line 35: function asort never defined awk: ./coorelate: line 35: function asort never defined — freehunter, May 01 '12 at 19:59
@freehunter: It means you have a non-GNU AWK (not `gawk`). I will post a version that doesn't require `asort()` and I'll try to avoid other gawkisms. — Dennis Williamson, May 01 '12 at 20:21
Thanks Dennis, this worked perfectly. I guess its time for me to learn AWK or Perl so I can do this myself. — freehunter, May 02 '12 at 12:21

score 0 · Answer 3 · answered May 01 '12 at 13:34

0

I would use the strings command to remove any binary from the files, cat them together then use uniq -c on the concatenated file to get a count of occurrences of a string

answered May 01 '12 at 13:34

Drake Clarris

1,047
6
10

+1, but you should sort the catted output, before `uniq -c`. Simplest form would be: `sort User? | uniq -c` – Michał Trybus May 01 '12 at 13:59
ah yeah I always forget to mention the sort first as it is near force of habit when using uniq after making so many mistakes in the past of not doing it – Drake Clarris May 01 '12 at 14:09

Comparing many files in Bash

3 Answers3