Unix command to find string set intersections or outliers?

Question

Is there a UNIX command on par with

sort | uniq

to find string set intersections or "outliers".

An example application: I have a list of html templates, some of them have {% load i18n %} string inside, others don't. I want to know which files don't.

edit: grep -L solves above problem.

How about this:

file1:

mom
dad
bob

file2:

dad

%intersect file1 file2

dad

%left-unique file1 file2

mom
bob

Dale Hagglund · Accepted Answer · 2010-04-23T07:25:26.190

39

It appears that grep -L solves the real problem of the poster, but for the actual question asked, finding the intersection of two sets of strings, you might want to look into the "comm" command. For example, if file1 and file2 each contain a sorted list of words, one word per line, then

$ comm -12 file1 file2

will produce the words common to both files. More generally, given sorted input files file1 and file2, the command

$ comm file1 file2

produces three columns of output

lines only in file1
lines only in file2
lines in both file1 and file2

You can suppress the column N in the output with the -N option. So, the command above, comm -12 file1 file2, suppresses columns 1 and 2, leaving only the words common to both files.

edited Apr 23 '10 at 07:25

answered Jun 19 '09 at 05:07

Dale Hagglund

16,074
4
30
37

3

Don't forget to run your files through sort before the comparison. I did, and the results sent me on a wild goose chase. – I. J. Kennedy Apr 23 '10 at 05:43
1

I do mention above that each file contains a "sorted list of words", but it might not jump right out at you. – Dale Hagglund Apr 23 '10 at 07:23

score 9 · Answer 2 · answered Jun 19 '09 at 04:27

9

Intersect:

# sort file1 file2 | uniq -d
dad

Left unique:

# sort file1 file2 | uniq -u
bob
mom

answered Jun 19 '09 at 04:27

John Kugelman

349,597
67
533
578

1

The intersect works, but left unique does not. It shows the unique values across the whole set, not those uniquely in the first. – Aaron McMillin Feb 19 '18 at 16:24

score 7 · Answer 3 · answered Aug 20 '12 at 05:13

From http://www.commandlinefu.com/commands/view/5710/intersection-between-two-files:

Intersection between two (unsorted) files:

grep -Fx -f file1 file2

Lines in file2 that are not in file1:

grep -Fxv -f file1 file2

Explanation:

The -f option tells grep to read the patterns to look for from a file. That means that it performs a search of file2 for each line in file1.
The -F option tells grep to see the search terms as fixed strings, and not as patterns, so that a.c will only match a.c and not abc,
The -x option tells grep to do whole line searches, so that "foo" in file1 won't match "foobar" in file2.
By default, grep will show only the matching lines, giving you the intersection. The -v option tells grep to only show non-matching lines, giving you the lines that are unique to file2.

score 5 · Answer 4 · answered Jun 19 '09 at 03:40

5

Maybe I'm misunderstanding the question, but why not just use grep to look for the string (use the -L option to have it print the names of files that don't have the string in them).

In other words

grep -L "{% load i18n %}" file1 file2 file3 ... etc

or with wildcards for the file names as appropriate.

answered Jun 19 '09 at 03:40

Tyler McHenry

74,820
18
121
166

1

For faster searching, I'd use -F too, since it's just a fixed string. – C. K. Young Jun 19 '09 at 03:42
what about set intersections? – Evgeny Jun 19 '09 at 03:44

score 2 · Answer 5 · answered Jun 19 '09 at 03:46

from man grep

-L, --files-without-match

Suppress normal output; instead print the name of each input file from which no output would normally have been printed. The scanning will stop on the first match.

So if your templates are .html files you want:

grep -L '{% load i18n %}' *.html

score 2 · Answer 6 · answered May 22 '18 at 12:34

Intersection:

comm -12 <(cat file1 | sort | uniq) <(cat file2 | sort | uniq)

All lines by 3 columns (file1 | file2 | intersection):

comm <(cat file1 | sort | uniq) <(cat file2 | sort | uniq)

If your files are not sorted and/or if there might be lines that are duplicated inside one of the files but don't appear at the other one - this one-line-command will sort your files, remove the duplicated lines and you will get directly your desired result.

Unix command to find string set intersections or outliers?

6 Answers6