0

Lets say I have an active/ directory that contains these files

active/
foo.bar.abc
foo.bar.xyz
foo.bat.abc

archive/
foo.bat.xyz

I want to write a command to output only unique filenames in active/ (uniqueness based on the middle item) AND doesn't match to any files already in archive/ (again based on that middle term).

Sample output:

foo.bar.abc

Explanation: output either foo.bar.abc or foo.bar.xyz doesn't matter. Not foo.bat.abc since foo.bat.xyz exists in archive/

I've found this to help identify unique values based on a pattern but I can't figure out how to combine that with my additional clause of no match in archive/

Community
  • 1
  • 1
JitterbugChew
  • 401
  • 3
  • 10
  • 4
    Any attempts yourself? – 123 Dec 07 '16 at 15:39
  • something like `ls | awk -v re='foo\.[[:alpha:]]\.' 'match($0, re, a) && !(a[0] in p) {p[a[0]]; print}'` prints out the unique file names for a single directory. I'm not sure where to start comparing that to another directory's contents. – JitterbugChew Dec 07 '16 at 15:45
  • @JitterbugChew: Are all the files always of type `word1.word2.word3`, with your requirement being uniqueness from `word2`? – Inian Dec 07 '16 at 16:45
  • @Inian Yes. `word1` is consistent, `word2` and `word3` vary – JitterbugChew Dec 07 '16 at 16:49

2 Answers2

2

Awk is actually not needed here, you can do it with simple grep/sed and sort:

(ls ./archive | sed 's/^/1 /'; ls ./active | sed 's/^/2 /') | \
  sort --field-separator="." --key="2,2" --uniq --stable | \
  grep '^2 ' | sed 's/^2 //'

Explanation:

First list both directories and mark which lines are from which directory. Then sort both listings together by their middle parts. Option --field-separator="." splits all lines into fields on dosts and option --key="2,2" tells to sort by the middle field, i.e. by the part between the dots. We use a stable sort to make sure lines from archive are the first and tell sort to print only the first matches of all duplicate lines.

Finally we filter only lines that we marked with 2, i.e. the lines from ./active.

Example:

active/
  foo.aaa.xxx
  foo.bar.abc
  foo.bar.xyz
  foo.bat.abc
  zoo.aaa.xxx
  zoo.bbb.aaa


archive/
  aaa.bbb.zoo
  foo.bat.xyz

Result:
  foo.aaa.xxx
  foo.bar.abc
martin.macko.47
  • 888
  • 5
  • 9
  • This works great. I wouldn't have thought to use sed to label each directory's contents. I guess if, for example, the separators were both `;` and `.`, you could do a sed to make them the same before processing it with sort? – JitterbugChew Dec 07 '16 at 17:14
  • Yes, you could preprocess the input before – martin.macko.47 Dec 07 '16 at 17:29
1

Another attempt using GNU grep, awk and GNU findutils

$ grep -Fxvf <(find active/ -type f -printf '%P\n' | awk -F'.' '!seen[$2]++') <(find archive/ -type f -printf '%P\n' | awk -F'.' '!seen[$2]++')
foo.bar.xyz

Am using process-substitution <() to run the find/awk commands and pass it to grep for finding the difference.

While find command lists the file on the specified directory, one entry per line, the awk filters the list by retaining the ones which are not duplicated by 2nd word. For awk, with the delimiter as . !seen[$2]++ prints only unique lines by hashing it in the array only if it hasn't been seen before.

Do remember the -printf '%P in find is NOT POSIX compatible and will work with GNU findutils. Recommend upgrading to it for this to work.

Other possible solutions, with a similar logic, one with comm and join, both part of GNU coreutils are below:-

$ join -v 2 <(find active/ -type f -printf '%P\n' | awk -F'.' '!seen[$2]++') <(find archive/ -type f -printf '%P\n' | awk -F'.' '!seen[$2]++')
foo.bar.xyz

Another with comm

$ comm -13 <(find active/ -type f -printf '%P\n' | awk -F'.' '!seen[$2]++') <(find archive/ -type f -printf '%P\n' | awk -F'.' '!seen[$2]++')
foo.bar.xyz
Inian
  • 80,270
  • 14
  • 142
  • 161
  • Maybe I copypasted something wrongly, but all your solutions seem to return `foo.bat.xyz`, not `foo.bar.xyz`. – martin.macko.47 Dec 07 '16 at 17:19
  • @martin.macko.47: What? All of them are returning `foo.bar.xyz` for me. Can you check the tool versions you are using? They need to `GNU findutils`, and others as part of `GNU coreutils`. I specifically added a note my solutions worked under those tools – Inian Dec 07 '16 at 17:22
  • `$ find --version` `find (GNU findutils) 4.4.2` Is it a wrong version? Imho the output of `find` is ok. Are you sure you have `foo.bat.xyz` in your `archive/` and not the other one? – martin.macko.47 Dec 07 '16 at 17:25
  • @martin.macko.47: Agreed. Can you make sure the `grep` and other tools are also same and invoking the command at the root level with both the folders `active` and `archive` under it? – Inian Dec 07 '16 at 17:26