Count the number of a given character in a file

Question

I need to count - in bash - the number of a given (single byte) character in a file. For example: count the number of commas, or dots or uppercase 'C' or... any other character.

Basically I need a generalized version of wc -l to count any single byte character (not just new lines) contained in a certain file.

I have to use it with very large files (several GB) so it has to be fast and resource efficient. Ideally the same level of performances you have with wc -l if you had to count new-lines.

Just write a trivial program in C. It's both trivial and efficient. — 4ae1e1, Jan 02 '16 at 06:38
@4ae1e1: agreed... wrote a program in C comparing one word at a time (same approach as `wc`). Updating my results here below... — mauro, Jan 02 '16 at 10:22
http://stackoverflow.com/questions/1603566/count-occurrences-of-a-char-in-plain-text-file — Gang, Jan 02 '16 at 22:16

anubhava · Answer 1 · 2020-08-06T07:13:55.817

8

You can use grep -o with wc -l. e.g. to count # of letter C in your input file:

grep -Fo 'C' file | wc -l

To get this done in single command you can use gnu awk with custom RS:

awk -v RS='C' 'END{print NR-1}' file

edited Aug 06 '20 at 07:13

answered Jan 02 '16 at 05:40

anubhava

761,203
64
569
643

1

I had a similar requirement recently, to filter very large text files (> 1Gb) and AWK was the fastest of all methods. – philbrooksjazz Jan 02 '16 at 08:59

mauro · Accepted Answer · 2016-01-02T10:30:01.090

Posting here the results of a few tests for documentation purposes... I did count the number of dots in a file containing 1,807,076,940 bytes and 100ML lines. Each line contains exactly one dot:

$ time wc -l xnorm.dat # takes 1.047 seconds(this count new-lines)
$ time grep -o '\.' xnorm.dat | wc -l # takes 87.443 seconds
$ time awk -v RS='.' 'END{print NR-1}' xnorm.dat # takes 53.947 seconds
$ time tr -d -C '\.' < xnorm.dat | wc -c # takes 3.732 seconds

Edit

Wrote a small program (fcc=fast char counter) in C as per 4ae1e1 suggestion:

$ time fcc -i xnorm.dat -c \. # takes 1.327 seconds

Count the number of a given character in a file

2 Answers2