What is a good/best way to count the number of characters, words, and lines of a text file using Perl (without using wc)?
-
presumably without doing a system call to use the 'wc' command? – JeeBee Apr 23 '09 at 14:50
-
This is not a homework assignment although I admit it could look like one. – NoahD Apr 23 '09 at 18:41
10 Answers
Here's the perl code. Counting words can be somewhat subjective, but I just say it's any string of characters that isn't whitespace.
open(FILE, "<file.txt") or die "Could not open file: $!";
my ($lines, $words, $chars) = (0,0,0);
while (<FILE>) {
$lines++;
$chars += length($_);
$words += scalar(split(/\s+/, $_));
}
print("lines=$lines words=$words chars=$chars\n");

- 15,841
- 8
- 34
- 55
-
-
3for word count, you want: scalar(split); That will split on /\s+/ and drop leading empty fields, just like awk does. – glenn jackman Apr 23 '09 at 15:33
-
As a note sort of related to glenn's, you can say "length;" instead of "length $_;" and Perl will default to using $_. However, using the defaults on split() is more beneficial, as it even has a default regex. – Chris Lutz Apr 23 '09 at 15:48
-
2@Paul Tomblin: here, are you happy now: perl -ne 'END{print"$. $c $w\n"}$c+=length;$w+=split' – Chas. Owens Apr 23 '09 at 15:55
-
@Brad Gilbert why not go all the way: open my $fh, "<", "file.txt" or die "could not open file: $!"; – Chas. Owens Apr 24 '09 at 19:40
-
@Chas: Reasons not to do so are many, but they include not specifying the encoding. See my answer. – tchrist Feb 28 '11 at 23:03
-
`use warnings` says "Use of implicit split to @_ is deprecated" – Dennis Williamson Jul 09 '12 at 22:05
A variation on bmdhacks' answer that will probably produce better results is to use \s+ (or even better \W+) as the delimiter. Consider the string "The quick brown fox" (additional spaces if it's not obvious). Using a delimiter of a single whitespace character will give a word count of six not four. So, try:
open(FILE, "<file.txt") or die "Could not open file: $!";
my ($lines, $words, $chars) = (0,0,0);
while (<FILE>) {
$lines++;
$chars += length($_);
$words += scalar(split(/\W+/, $_));
}
print("lines=$lines words=$words chars=$chars\n");
Using \W+ as the delimiter will stop punctuation (amongst other things) from counting as words.

- 7,051
- 4
- 31
- 40
-
Using \W will split "nit-picking" into two words. I don't know if this is correct behavior or not, but I always thought of hyphenated words as one word rather than two. – Chris Lutz Apr 23 '09 at 15:50
-
It's one of those 'you pays your money, you makes your choice' things. Personally, I usually roll my own regex that fits the definition of 'word' I need at the time. Quite often split can be less than helpful because it is a negative match. A normal regex matches the characters you *do* want, generally a better idea. You could certainly do the same sort of thing using m/.../g and calling it in list context. – Nic Gibson Apr 23 '09 at 17:19
-
1This only counts code points, not characters (=graphemes). And it forgets to set the encoding. – tchrist Feb 28 '11 at 23:04
-
@tchrist - yup, I was aware of that but didn't think it worth adding. You're probably right. One the other, good point - I wasn't thinking there. I'm going to leave it as is as it's so long ago it doesn't seem worth changing given other answers – Nic Gibson Mar 01 '11 at 14:34
Here. Try this Unicode-savvy version of the wc program.
It skips non-file arguments (pipes, directories, sockets, etc).
It assumes UTF-8 text.
It counts any Unicode whitespace as a word separator.
It also accepts alternate encodings if there is a
.ENCODING
at the end of the filename, likefoo.cp1252
,foo.latin1
,foo.utf16
, etc.It also work with files that have been compressed in a variety of formats.
It gives counts of Paragraphs, Lines, Words, Graphemes, Characters, and Bytes.
It understands all Unicode linebreak sequences.
It warns about corrupted textfiles with linebreak errors.
Here’s an example of running it:
Paras Lines Words Graphs Chars Bytes File
2 2270 82249 504169 504333 528663 /tmp/ap
1 2404 11163 63164 63164 66336 /tmp/b3
uwc: missing linebreak at end of corrupted textfiile /tmp/bad
1* 2* 4 19 19 19 /tmp/bad
1 14 52 273 273 293 /tmp/es
57 383 1369 11997 11997 12001 /tmp/funny
1 657068 3175429 31205970 31209138 32633834 /tmp/lw
1 1 4 27 27 27 /tmp/nf.cp1252
1 1 4 27 27 34 /tmp/nf.euc-jp
1 1 4 27 27 27 /tmp/nf.latin1
1 1 4 27 27 27 /tmp/nf.macroman
1 1 4 27 27 54 /tmp/nf.ucs2
1 1 4 27 27 56 /tmp/nf.utf16
1 1 4 27 27 54 /tmp/nf.utf16be
1 1 4 27 27 54 /tmp/nf.utf16le
1 1 4 27 27 112 /tmp/nf.utf32
1 1 4 27 27 108 /tmp/nf.utf32be
1 1 4 27 27 108 /tmp/nf.utf32le
1 1 4 27 27 39 /tmp/nf.utf7
1 1 4 27 27 31 /tmp/nf.utf8
1 26906 101528 635841 636026 661202 /tmp/o2
131 346 1370 9590 9590 4486 /tmp/perl5122delta.pod.gz
291 814 3941 25318 25318 9878 /tmp/perl51310delta.pod.bz2
1 2551 5345 132655 132655 133178 /tmp/tailsort-pl.utf8
1 89 334 1784 1784 2094 /tmp/til
1 4 18 88 88 106 /tmp/w
276 1736 5773 53782 53782 53804 /tmp/www
Here ya go:
#!/usr/bin/env perl
#########################################################################
# uniwc - improved version of wc that works correctly with Unicode
#
# Tom Christiansen <tchrist@perl.com>
# Mon Feb 28 15:59:01 MST 2011
#########################################################################
use 5.10.0;
use strict;
use warnings FATAL => "all";
use sigtrap qw[ die untrapped normal-signals ];
use Carp;
$SIG{__WARN__} = sub {
confess("FATALIZED WARNING: @_") unless $^S;
};
$SIG{__DIE__} = sub {
confess("UNCAUGHT EXCEPTION: @_") unless $^S;
};
$| = 1;
my $Errors = 0;
my $Headers = 0;
sub yuck($) {
my $errmsg = $_[0];
$errmsg =~ s/(?<=[^\n])\z/\n/;
print STDERR "$0: $errmsg";
}
process_input(\&countem);
sub countem {
my ($_, $file) = @_;
my (
@paras, @lines, @words,
$paracount, $linecount, $wordcount,
$grafcount, $charcount, $bytecount,
);
if ($charcount = length($_)) {
$wordcount = eval { @words = split m{ \p{Space}+ }x };
yuck "error splitting words: $@" if $@;
$linecount = eval { @lines = split m{ \R }x };
yuck "error splitting lines: $@" if $@;
$grafcount = 0;
$grafcount++ while /\X/g;
#$grafcount = eval { @lines = split m{ \R }x };
yuck "error splitting lines: $@" if $@;
$paracount = eval { @paras = split m{ \R{2,} }x };
yuck "error splitting paras: $@" if $@;
if ($linecount && !/\R\z/) {
yuck("missing linebreak at end of corrupted textfiile $file");
$linecount .= "*";
$paracount .= "*";
}
}
$bytecount = tell;
if (-e $file) {
$bytecount = -s $file;
if ($bytecount != -s $file) {
yuck "filesize of $file differs from bytecount\n";
$Errors++;
}
}
my $mask = "%8s " x 6 . "%s\n";
printf $mask => qw{ Paras Lines Words Graphs Chars Bytes File } unless $Headers++;
printf $mask => map( { show_undef($_) }
$paracount, $linecount,
$wordcount, $grafcount,
$charcount, $bytecount,
), $file;
}
sub show_undef {
my $value = shift;
return defined($value)
? $value
: "undef";
}
END {
close(STDOUT) || die "$0: can't close STDOUT: $!";
exit($Errors != 0);
}
sub process_input {
my $function = shift();
my $enc;
if (@ARGV == 0 && -t) {
warn "$0: reading from stdin, type ^D to end or ^C to kill.\n";
}
unshift(@ARGV, "-") if @ARGV == 0;
FILE:
for my $file (@ARGV) {
# don't let magic open make an output handle
next if -e $file && ! -f _;
my $quasi_filename = fix_extension($file);
$file = "standard input" if $file eq q(-);
$quasi_filename =~ s/^(?=\s*[>|])/< /;
no strict "refs";
my $fh = $file; # is *so* a lexical filehandle! ☺
unless (open($fh, $quasi_filename)) {
yuck("couldn't open $quasi_filename: $!");
next FILE;
}
set_encoding($fh, $file) || next FILE;
my $whole_file = eval {
use warnings "FATAL" => "all";
local $/;
scalar <$fh>;
};
if ($@) {
$@ =~ s/ at \K.*? line \d+.*/$file line $./;
yuck($@);
next FILE;
}
$function->($whole_file, $file);
unless (close $fh) {
yuck("couldn't close $quasi_filename at line $.: $!");
next FILE;
}
} # foreach file
}
sub set_encoding(*$) {
my ($handle, $path) = @_;
my $enc_name = "utf8";
if ($path && $path =~ m{ \. ([^\s.]+) \z }x) {
my $ext = $1;
die unless defined $ext;
require Encode;
if (my $enc_obj = Encode::find_encoding($ext)) {
my $name = $enc_obj->name || $ext;
$enc_name = "encoding($name)";
}
}
return 1 if eval {
use warnings FATAL => "all";
no strict "refs";
binmode($handle, ":$enc_name");
1;
};
for ($@) {
s/ at .* line \d+\.//;
s/$/ for $path/;
}
yuck("set_encoding: $@");
return undef;
}
sub fix_extension {
my $path = shift();
my %Compress = (
Z => "zcat",
z => "gzcat", # for uncompressing
gz => "gzcat",
bz => "bzcat",
bz2 => "bzcat",
bzip => "bzcat",
bzip2 => "bzcat",
lzma => "lzcat",
);
if ($path =~ m{ \. ( [^.\s] +) \z }x) {
if (my $prog = $Compress{$1}) {
return "$prog $path |";
}
}
return $path;
}

- 78,834
- 30
- 123
- 180
I stumbled upon this while googling for a character count solution. Admittedly, I know next to nothing about perl so some of this may be off base, but here are my tweaks of newt's solution.
First, there is a built-in line count variable anyway, so I just used that. This is probably a bit more efficient, I guess. As it is, the character count includes newline characters, which is probably not what you want, so I chomped $_. Perl also complained about the way the split() is done (implicit split, see: Why does Perl complain "Use of implicit split to @_ is deprecated"? ) so I tweaked that. My input files are UTF-8 so I opened them as such. That probably helps get the correct character count in the input file contains non-ASCII characters.
Here's the code:
open(FILE, "<:encoding(UTF-8)", "file.txt") or die "Could not open file: $!";
my ($lines, $words, $chars) = (0,0,0);
my @wordcounter;
while (<FILE>) {
chomp($_);
$chars += length($_);
@wordcounter = split(/\W+/, $_);
$words += @wordcounter;
}
$lines = $.;
close FILE;
print "\nlines=$lines, words=$words, chars=$chars\n";
There is the Perl Power Tools project whose goal is to reconstruct all the Unix bin utilities, primarily for those on operating systems deprived of Unix. Yes, they did wc. The implementation is overkill, but it is POSIX compliant.
It gets a little ridiculous when you look at the GNU compliant implementation of true.

- 153,029
- 25
- 195
- 336
-
-
Schwern: I’ve been reïmplementing quite a bit of PPT for Unicode smartsiness. I’ve lately done `cat -v/od -c`, `expand`, `fmt`, `grep`, `look`, `rev`, `sort`, and `wc`. All are improved over the originals. – tchrist Feb 28 '11 at 23:07
Reading the file in fixed-size chunks may be more efficient than reading line-by-line. The wc
binary does this.
#!/usr/bin/env perl
use constant BLOCK_SIZE => 16384;
for my $file (@ARGV) {
open my $fh, '<', $file or do {
warn "couldn't open $file: $!\n";
continue;
};
my ($chars, $words, $lines) = (0, 0, 0);
my ($new_word, $new_line);
while ((my $size = sysread $fh, local $_, BLOCK_SIZE) > 0) {
$chars += $size;
$words += /\s+/g;
$words-- if $new_word && /\A\s/;
$lines += () = /\n/g;
$new_word = /\s\Z/;
$new_line = /\n\Z/;
}
$lines-- if $new_line;
print "\t$lines\t$words\t$chars\t$file\n";
}

- 198,619
- 38
- 280
- 391
-
2I'm not sure this gives you any benefit. Under the hood, as it were, perl's <> operator is using buffered IO. All you have done here is rewrite something built-in with something that has to be interpreted. – Nic Gibson Apr 23 '09 at 17:20
-
True. At least with my installation of 5.8.8, Perl buffers 4096 bytes at a time, and there's no performance benefit to doing this manually -- as you suspected, if anything, it's actually worse. I like reminding people to think low-level though :) – ephemient Apr 23 '09 at 17:35
-
1And what do you do about UTF-8 chars that are split across block boundaries, eh? – tchrist Feb 28 '11 at 18:49
To be able to count CHARS and not bytes, consider this:
(Try it with Chinese or Cyrillic letters and file saved in utf8)
use utf8;
my $file='file.txt';
my $LAYER = ':encoding(UTF-8)';
open( my $fh, '<', $file )
|| die( "$file couldn't be opened: $!" );
binmode( $fh, $LAYER );
read $fh, my $txt, -s $file;
close $fh;
print length $txt,$/;
use bytes;
print length $txt,$/;

- 1,383
- 10
- 22
-
Perl defaults to using the system locale. If your system is modern, the system locale will be an UTF-8 encoding, and thus Perl IO will be UTF-8 by default. If not, you probably should be using the system locale and not forcing UTF-8 mode... – ephemient Apr 23 '09 at 21:00
-
Wrong, ephemient. Perl defaults to the system locale, but prints characters 128-255 as "?" for backwards compatibility. To print proper UTF-8, one should say binmode($fh, ":utf8"); before using the filehandle. In this case, "use utf8;" is useless - it tells Perl that the source code can be in UTF-8, which is unnecessary unless you have variables names like $áccent or $ümlats. – Chris Lutz Apr 24 '09 at 09:47
-
@Chris Both my Perl 5.8 and 5.10 are documented as having `-C SDL` as the default, and `perl -e 'print "\xe2\x81\x89\n"'` produces "⁉" as expected -- not "???" as you seem to expect. – ephemient Apr 24 '09 at 15:24
-
I think those three hex values are combining into one UTF-8 character. And UTF-8 characters will print in Perl. Just not the ones from 128-255. Trying any one of those three hex codes individually on my machine gives me "?", whereas prefixing it with binmode(STDOUT, ":utf8"); gives me "â" for \xe2 and non printing characters for the other two. And as far as I can tell, I have no default setting of "-C" anything. – Chris Lutz Apr 24 '09 at 19:17
-
`echo $'\xe2\x80\x99' | perl -ne'print length,$/'` outputs 4 while `echo $'\xe2\x80\x99' | perl -CSDL -ne'print length,$/'` outputs 2, so I must be misremembering and Chris is correct. – ephemient Apr 24 '09 at 20:47
-
@Berov: You’ve got the right idea!! But see my solution for a rather more elaborate version. – tchrist Feb 28 '11 at 23:08
This may be helpful to Perl beginners. I tried to simulate MS word counting functionalities and added one more feature which is not shown using wc in Linux.
- number of lines
- number of words
- number of characters with space
- number of characters without space (wc will not give this in its output but Microsoft words shows it.)
Here is the url: Counting words,characters and lines in a file

- 521
- 6
- 31