1

I need to count - in bash - the number of a given (single byte) character in a file. For example: count the number of commas, or dots or uppercase 'C' or... any other character.

Basically I need a generalized version of wc -l to count any single byte character (not just new lines) contained in a certain file.

I have to use it with very large files (several GB) so it has to be fast and resource efficient. Ideally the same level of performances you have with wc -l if you had to count new-lines.

mauro
  • 5,730
  • 2
  • 26
  • 25
  • Just write a trivial program in C. It's both trivial and efficient. – 4ae1e1 Jan 02 '16 at 06:38
  • @4ae1e1: agreed... wrote a program in C comparing one word at a time (same approach as `wc`). Updating my results here below... – mauro Jan 02 '16 at 10:22
  • http://stackoverflow.com/questions/1603566/count-occurrences-of-a-char-in-plain-text-file – Gang Jan 02 '16 at 22:16

2 Answers2

8

You can use grep -o with wc -l. e.g. to count # of letter C in your input file:

grep -Fo 'C' file | wc -l

To get this done in single command you can use gnu awk with custom RS:

awk -v RS='C' 'END{print NR-1}' file
anubhava
  • 761,203
  • 64
  • 569
  • 643
3

Posting here the results of a few tests for documentation purposes... I did count the number of dots in a file containing 1,807,076,940 bytes and 100ML lines. Each line contains exactly one dot:

$ time wc -l xnorm.dat # takes 1.047 seconds(this count new-lines)
$ time grep -o '\.' xnorm.dat | wc -l # takes 87.443 seconds
$ time awk -v RS='.' 'END{print NR-1}' xnorm.dat # takes 53.947 seconds
$ time tr -d -C '\.' < xnorm.dat | wc -c # takes 3.732 seconds

Edit

Wrote a small program (fcc=fast char counter) in C as per 4ae1e1 suggestion:

$ time fcc -i xnorm.dat -c \. # takes 1.327 seconds
mauro
  • 5,730
  • 2
  • 26
  • 25