Count unique words in all text files in directory, and delete those having less than 2?

Question

This gets me the count. But how to delete those files having count < 2?

$ cat ./a1esso.doc | grep -o -E '\w+' | sort -u -f | wc --words
1
$ cat ./a1brit.doc | grep -o -E '\w+' | sort -u -f | wc --words
4

How to grab the filenames of those that have less than 2, so we may delete them? I will be scanning millions of files. A find command can find all the files, but the filename needs to be propagated through the pipeline it seems. At the right end, the rm command can be used it seems.

Thanks for reading.

Update:

The correct answer is going to use an input pipeline to feed filenames. This is not negotiable. This program is not for use on the one input file shown in the example, but is coming from a dynamic list of many files.

A filter apparatus to identify the names of the files which are meeting the criterion, will also be present in the accepted answer. This is not negotiable either.

Red Cricket · Answer 1 · 2018-11-21T00:38:15.640

0

You could do this …

 test $(grep -o -E '\w+' ./a1esso.doc | sort -u -f | wc --words) -lt 2 && rm alesso.doc

Update: removed useless cat as per David's comment.

edited Nov 21 '18 at 00:38

answered Nov 21 '18 at 00:29

Red Cricket

9,762
21
81
166

1

`cat ./a1esso.doc` is an *Unnecessary Use Of* `cat` (*UUOc*). Instead `grep -o -E '\w+' alesso.doc | ...` – David C. Rankin Nov 21 '18 at 00:37
The answer cannot get chosen as written. The correct answer is going to use cat to feed filenames, as I already showed. This is not negotiable. – Geoffrey Anderson Nov 21 '18 at 03:05
You don't need `cat` to "feed filenames". `grep` takes a filename as an argument. `cat file > grep ...` is equivalent to `grep … file`, it is just that for former is consider bad form. – Red Cricket Nov 21 '18 at 03:19

Count unique words in all text files in directory, and delete those having less than 2?

1 Answers1