6

I have been trying to make the scripts I write simpler and simpler.

There are numerous ways to write get the word count of all files in a folder, or even all files of subdirectories of a folder.

For instance, I could write

wc */* 

and I might get output like this (this is the desired output):

   0        0        0 10.53400000/YRI.GS000018623.NONSENSE.vcf
   0        0        0 10.53400000/YRI.GS000018623.NONSTOP.vcf
   0        0        0 10.53400000/YRI.GS000018623.PFAM.vcf
   0        0        0 10.53400000/YRI.GS000018623.SPAN.vcf
   0        0        0 10.53400000/YRI.GS000018623.SVLEN.vcf
   2       20      624 10.53400000/YRI.GS000018623.SVTYPE.vcf
   2       20      676 10.53400000/YRI.GS000018623.SYNONYMOUS.vcf
  13      130     4435 10.53400000/YRI.GS000018623.TSS-UPSTREAM.vcf
 425     4250   126381 10.53400000/YRI.GS000018623.UNKNOWN-INC.vcf

but if there are too many files, I might get an error message like this:

-bash: /usr/bin/wc: Argument list too long

so, I could make a variable and do one folder at a time, like so:

while read $FOLDER
do
    wc $FOLDER/* >> outfile.txt
done < "$FOLDER_LIST"

so this goes from one line to 5 just like that.

Further, in one case, I want to use grep -v first, then carryout the word counting, like so:

grep -v dbsnp */* | wc

but this would suffer from two errors:

  1. Argument list too long
  2. If it were not too long, it would give the wc for all of the files at once, not per file.

So, to recap, I would love to be able to do this:

grep -v dbsnp */* wc > Outfile.txt
awk '{print $4,$1} Outfile.txt > Outfile.summary.txt

and have it return output like I showed above.

Is there a very simple way to do this? Or I am looking at a loop at minimum? Again, I know 101 ways to do this just like the rest of us using a 4-10 line script, but I would love to be able to just type 2 one liners into the command prompt...and my knowledge of the shell is not yet deep enough to know which ways would allow what I am asking of the OS.

EDIT -

A solution was proposed:

find -exec grep -v dbsnp {} \; | xargs -n 1 wc

This solution leads to the following output:

wc: 1|0:53458644:AMBIGUOUS:CCAGGGC|-16&GCCAGGGCCAGGGC|-18&GCCAGGGCC|-19&GGCCAGGGC|-19&GCCAGGGCG|-19,.:48:48,48:4,4:0,17:-48,0,-48:0,0,-17:27:3,24:24: No such file or directory
wc: 10: No such file or directory
wc: 53460829: No such file or directory
wc: .: Is a directory
      0       0       0 .
wc: AA: No such file or directory
wc: CT: No such file or directory
wc: .: Is a directory
      0       0       0 .
wc: .: Is a directory
      0       0       0 .

As nearly as I can tell, appears to be treating each line as a file. I am still reviewing the other answers, and thanks for your help.

tshepang
  • 12,111
  • 21
  • 91
  • 136
Vincent Laufer
  • 705
  • 10
  • 26

4 Answers4

3

You mentioned that "this does not solve the problem of returning the wc in an item-by-item fashion"

Following will:

find -exec wc {} \;

But this won't come with your grep filter "grep -v"

If you intend to do the same as indicated by my comment on this answer, then please check if following works for you:

find -exec bash -c  "echo -n {}; grep -v dbsnp {} | wc " \;
PradyJord
  • 2,136
  • 12
  • 19
  • @Vincent I could not understand the purpose of using `grep -v`, If you can elaborate a little on that, may be we shall try and build a solution around it. they way your are using `grep -v`, it will exclude all the line which contains `dbsnp`, and count the words, or you just want to exclude dbsnp from word count ? – PradyJord Jun 05 '14 at 07:29
  • I would like to exclude the whole line, and I should add I am actually aiming to use wc -l This worked and had the intended effect (all the other answers actually did fail - thank you!!!) – Vincent Laufer Jun 05 '14 at 08:27
  • please check 2nd find – PradyJord Jun 05 '14 at 08:29
2

You have too many matches to the */* so grep receives a long argument list. You can use find to circumvent this:

find -exec grep -v dbsnp {} \; | wc

and perhaps you want to get rid of possible traversal errors too:

find -exec grep -v dbsnp {} \; 2> /dev/null | wc
perreal
  • 94,503
  • 21
  • 155
  • 181
  • This is very interesting. Why does wc error, but find does not? How can I go about learning things like this, which you apparently know, but I don't? I do not see this info on the man page for wc. Also, this does not solve the problem of returning the wc in an item-by-item fashion; rather it returns only the total. – Vincent Laufer Jun 05 '14 at 06:20
  • @VincentLaufer You will want to read about `ARG_MAX` [here](http://www.in-ulm.de/~mascheck/various/argmax/). `find -exec` is designed to work around this by aggregating into sets which fit into `ARG_MAX` (see [here](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/find.html): *"The size of any set of two or more pathnames shall be limited such that execution of the utility does not cause the system's {ARG_MAX} limit to be exceeded."*. – Adrian Frühwirth Jun 05 '14 at 07:00
0

This works for me:

grep -or "[a-zA-Z]*" * | cut -d":" -f2 | sort | uniq -c

What you're looking is MapReduce algorithm http://en.wikipedia.org/wiki/MapReduce

nervosol
  • 1,295
  • 3
  • 24
  • 46
0

Based on perreal's answer:

If you want the wc file by file, you could use xargs:

find -exec grep -v dbsnp {} \; | xargs -n 1 wc

xargs can read the standard input and build and execute command lines with it. So it reads the result of your input stream and executes wc for each single item (-n 1).

Stefan Winkler
  • 3,871
  • 1
  • 18
  • 35
  • 2
    Your second example is just as much subject to `ARG_MAX` as OP's `wc */*` is so it won't work either if the glob expansion is too large. – Adrian Frühwirth Jun 05 '14 at 07:03