21

What is a good/best way to count the number of characters, words, and lines of a text file using Perl (without using wc)?

brian d foy
  • 129,424
  • 31
  • 207
  • 592
NoahD
  • 8,092
  • 4
  • 27
  • 28

10 Answers10

25

Here's the perl code. Counting words can be somewhat subjective, but I just say it's any string of characters that isn't whitespace.

open(FILE, "<file.txt") or die "Could not open file: $!";

my ($lines, $words, $chars) = (0,0,0);

while (<FILE>) {
    $lines++;
    $chars += length($_);
    $words += scalar(split(/\s+/, $_));
}

print("lines=$lines words=$words chars=$chars\n");
bmdhacks
  • 15,841
  • 8
  • 34
  • 55
7

A variation on bmdhacks' answer that will probably produce better results is to use \s+ (or even better \W+) as the delimiter. Consider the string "The  quick  brown fox" (additional spaces if it's not obvious). Using a delimiter of a single whitespace character will give a word count of six not four. So, try:

open(FILE, "<file.txt") or die "Could not open file: $!";

my ($lines, $words, $chars) = (0,0,0);

while (<FILE>) {
    $lines++;
    $chars += length($_);
    $words += scalar(split(/\W+/, $_));
}

print("lines=$lines words=$words chars=$chars\n");

Using \W+ as the delimiter will stop punctuation (amongst other things) from counting as words.

Nic Gibson
  • 7,051
  • 4
  • 31
  • 40
  • Using \W will split "nit-picking" into two words. I don't know if this is correct behavior or not, but I always thought of hyphenated words as one word rather than two. – Chris Lutz Apr 23 '09 at 15:50
  • It's one of those 'you pays your money, you makes your choice' things. Personally, I usually roll my own regex that fits the definition of 'word' I need at the time. Quite often split can be less than helpful because it is a negative match. A normal regex matches the characters you *do* want, generally a better idea. You could certainly do the same sort of thing using m/.../g and calling it in list context. – Nic Gibson Apr 23 '09 at 17:19
  • 1
    This only counts code points, not characters (=graphemes). And it forgets to set the encoding. – tchrist Feb 28 '11 at 23:04
  • @tchrist - yup, I was aware of that but didn't think it worth adding. You're probably right. One the other, good point - I wasn't thinking there. I'm going to leave it as is as it's so long ago it doesn't seem worth changing given other answers – Nic Gibson Mar 01 '11 at 14:34
4

The Word Count tool counts characters, words and lines in text files

TStamper
  • 30,098
  • 10
  • 66
  • 73
3

Here. Try this Unicode-savvy version of the wc program.

  • It skips non-file arguments (pipes, directories, sockets, etc).

  • It assumes UTF-8 text.

  • It counts any Unicode whitespace as a word separator.

  • It also accepts alternate encodings if there is a .ENCODING at the end of the filename, like foo.cp1252, foo.latin1, foo.utf16, etc.

  • It also work with files that have been compressed in a variety of formats.

  • It gives counts of Paragraphs, Lines, Words, Graphemes, Characters, and Bytes.

  • It understands all Unicode linebreak sequences.

  • It warns about corrupted textfiles with linebreak errors.

Here’s an example of running it:

   Paras    Lines    Words   Graphs    Chars    Bytes File
       2     2270    82249   504169   504333   528663 /tmp/ap
       1     2404    11163    63164    63164    66336 /tmp/b3
    uwc: missing linebreak at end of corrupted textfiile /tmp/bad
      1*       2*        4       19       19       19 /tmp/bad
       1       14       52      273      273      293 /tmp/es
      57      383     1369    11997    11997    12001 /tmp/funny
       1   657068  3175429 31205970 31209138 32633834 /tmp/lw
       1        1        4       27       27       27 /tmp/nf.cp1252
       1        1        4       27       27       34 /tmp/nf.euc-jp
       1        1        4       27       27       27 /tmp/nf.latin1
       1        1        4       27       27       27 /tmp/nf.macroman
       1        1        4       27       27       54 /tmp/nf.ucs2
       1        1        4       27       27       56 /tmp/nf.utf16
       1        1        4       27       27       54 /tmp/nf.utf16be
       1        1        4       27       27       54 /tmp/nf.utf16le
       1        1        4       27       27      112 /tmp/nf.utf32
       1        1        4       27       27      108 /tmp/nf.utf32be
       1        1        4       27       27      108 /tmp/nf.utf32le
       1        1        4       27       27       39 /tmp/nf.utf7
       1        1        4       27       27       31 /tmp/nf.utf8
       1    26906   101528   635841   636026   661202 /tmp/o2
     131      346     1370     9590     9590     4486 /tmp/perl5122delta.pod.gz
     291      814     3941    25318    25318     9878 /tmp/perl51310delta.pod.bz2
       1     2551     5345   132655   132655   133178 /tmp/tailsort-pl.utf8
       1       89      334     1784     1784     2094 /tmp/til
       1        4       18       88       88      106 /tmp/w
     276     1736     5773    53782    53782    53804 /tmp/www

Here ya go:

#!/usr/bin/env perl 
#########################################################################
# uniwc - improved version of wc that works correctly with Unicode
#
# Tom Christiansen <tchrist@perl.com>
# Mon Feb 28 15:59:01 MST 2011
#########################################################################

use 5.10.0;

use strict;
use warnings FATAL => "all";
use sigtrap qw[ die untrapped normal-signals ];

use Carp;

$SIG{__WARN__}  = sub {
    confess("FATALIZED WARNING: @_")  unless $^S;
};

$SIG{__DIE__}  = sub {
    confess("UNCAUGHT EXCEPTION: @_")  unless $^S;
};

$| = 1;

my $Errors = 0;
my $Headers = 0;

sub yuck($) {
    my $errmsg = $_[0];
    $errmsg =~ s/(?<=[^\n])\z/\n/;
    print STDERR "$0: $errmsg";
}

process_input(\&countem);

sub countem { 
    my ($_, $file) = @_;

    my (
        @paras, @lines, @words,
        $paracount, $linecount, $wordcount, 
        $grafcount, $charcount, $bytecount,
    );

    if ($charcount = length($_)) {
        $wordcount = eval { @words = split m{ \p{Space}+  }x }; 
        yuck "error splitting words: $@" if $@;

        $linecount = eval { @lines = split m{ \R     }x }; 
        yuck "error splitting lines: $@" if $@;

        $grafcount = 0;
        $grafcount++ while /\X/g;
        #$grafcount = eval { @lines = split m{ \R     }x }; 
        yuck "error splitting lines: $@" if $@;

        $paracount = eval { @paras = split m{ \R{2,} }x }; 
        yuck "error splitting paras: $@" if $@;

        if ($linecount && !/\R\z/) {
            yuck("missing linebreak at end of corrupted textfiile $file");
            $linecount .= "*";
            $paracount .= "*";
        } 
    }

    $bytecount = tell;
    if (-e $file) {
        $bytecount = -s $file;
        if ($bytecount != -s $file) {
            yuck "filesize of $file differs from bytecount\n";
            $Errors++;
        }
    } 
    my $mask = "%8s " x 6 . "%s\n";
    printf  $mask => qw{ Paras Lines Words Graphs Chars Bytes File } unless $Headers++;

    printf $mask => map( { show_undef($_) } 
                                $paracount, $linecount, 
                                $wordcount, $grafcount, 
                                $charcount, $bytecount,
                       ), $file;
} 

sub show_undef {
    my $value = shift;
    return defined($value)
             ? $value
             : "undef";
} 

END { 
    close(STDOUT) || die "$0: can't close STDOUT: $!";
    exit($Errors != 0);
}

sub process_input {

    my $function = shift();

    my $enc;

    if (@ARGV == 0 && -t) {
        warn "$0: reading from stdin, type ^D to end or ^C to kill.\n";
    }

    unshift(@ARGV, "-") if @ARGV == 0;

FILE:

    for my $file (@ARGV) {
        # don't let magic open make an output handle

        next if -e $file && ! -f _;

        my $quasi_filename = fix_extension($file);

        $file = "standard input" if $file eq q(-);
        $quasi_filename =~ s/^(?=\s*[>|])/< /;

        no strict "refs";
        my $fh = $file;   # is *so* a lexical filehandle! ☺
        unless (open($fh, $quasi_filename)) {
            yuck("couldn't open $quasi_filename: $!");
            next FILE;
        }
        set_encoding($fh, $file) || next FILE;

        my $whole_file = eval {
            use warnings "FATAL" => "all";
            local $/;
            scalar <$fh>;
        };

        if ($@) {
            $@ =~ s/ at \K.*? line \d+.*/$file line $./;
            yuck($@);
            next FILE;
        }

        $function->($whole_file, $file);

        unless (close $fh) {
            yuck("couldn't close $quasi_filename at line $.: $!");
            next FILE;
        }

    } # foreach file

}

sub set_encoding(*$) {
    my ($handle, $path) = @_;

    my $enc_name = "utf8";

    if ($path && $path =~ m{ \. ([^\s.]+) \z }x) {
        my $ext = $1;
        die unless defined $ext;
        require Encode;
        if (my $enc_obj = Encode::find_encoding($ext)) {
            my $name = $enc_obj->name || $ext;
            $enc_name = "encoding($name)";
        }
    }

    return 1 if eval {
        use warnings FATAL => "all";
        no strict "refs";
        binmode($handle, ":$enc_name");
        1;
    };

    for ($@) {
        s/ at .* line \d+\.//;
        s/$/ for $path/;
    }

    yuck("set_encoding: $@");

    return undef;
}

sub fix_extension {
    my $path = shift();
    my %Compress = (
        Z       =>  "zcat",
        z       => "gzcat",            # for uncompressing
        gz      => "gzcat",
        bz      => "bzcat",
        bz2     => "bzcat",
        bzip    => "bzcat",
        bzip2   => "bzcat",
        lzma    => "lzcat",
    );

    if ($path =~ m{ \. ( [^.\s] +) \z }x) {
        if (my $prog = $Compress{$1}) {
            return "$prog $path |";
        } 
    } 

    return $path;

}

tchrist
  • 78,834
  • 30
  • 123
  • 180
2

I stumbled upon this while googling for a character count solution. Admittedly, I know next to nothing about perl so some of this may be off base, but here are my tweaks of newt's solution.

First, there is a built-in line count variable anyway, so I just used that. This is probably a bit more efficient, I guess. As it is, the character count includes newline characters, which is probably not what you want, so I chomped $_. Perl also complained about the way the split() is done (implicit split, see: Why does Perl complain "Use of implicit split to @_ is deprecated"? ) so I tweaked that. My input files are UTF-8 so I opened them as such. That probably helps get the correct character count in the input file contains non-ASCII characters.

Here's the code:

open(FILE, "<:encoding(UTF-8)", "file.txt") or die "Could not open file: $!";

my ($lines, $words, $chars) = (0,0,0);
my @wordcounter;
while (<FILE>) {
    chomp($_);
    $chars += length($_);
    @wordcounter = split(/\W+/, $_);
    $words += @wordcounter;
}
$lines = $.;
close FILE;
print "\nlines=$lines, words=$words, chars=$chars\n";
Community
  • 1
  • 1
elef
  • 21
  • 2
2

There is the Perl Power Tools project whose goal is to reconstruct all the Unix bin utilities, primarily for those on operating systems deprived of Unix. Yes, they did wc. The implementation is overkill, but it is POSIX compliant.

It gets a little ridiculous when you look at the GNU compliant implementation of true.

Schwern
  • 153,029
  • 25
  • 195
  • 336
  • Most of the fancy 'true' implementation is POD. Still ridiculous. – Chris Lutz Apr 24 '09 at 09:50
  • Schwern: I’ve been reïmplementing quite a bit of PPT for Unicode smartsiness. I’ve lately done `cat -v/od -c`, `expand`, `fmt`, `grep`, `look`, `rev`, `sort`, and `wc`. All are improved over the originals. – tchrist Feb 28 '11 at 23:07
1

Non-serious answer:

system("wc foo");
Paul Tomblin
  • 179,021
  • 58
  • 319
  • 408
1

Reading the file in fixed-size chunks may be more efficient than reading line-by-line. The wc binary does this.

#!/usr/bin/env perl

use constant BLOCK_SIZE => 16384;

for my $file (@ARGV) {
    open my $fh, '<', $file or do {
        warn "couldn't open $file: $!\n";
        continue;
    };

    my ($chars, $words, $lines) = (0, 0, 0);

    my ($new_word, $new_line);
    while ((my $size = sysread $fh, local $_, BLOCK_SIZE) > 0) {
        $chars += $size;
        $words += /\s+/g;
        $words-- if $new_word && /\A\s/;
        $lines += () = /\n/g;

        $new_word = /\s\Z/;
        $new_line = /\n\Z/;
    }
    $lines-- if $new_line;

    print "\t$lines\t$words\t$chars\t$file\n";
}
ephemient
  • 198,619
  • 38
  • 280
  • 391
  • 2
    I'm not sure this gives you any benefit. Under the hood, as it were, perl's <> operator is using buffered IO. All you have done here is rewrite something built-in with something that has to be interpreted. – Nic Gibson Apr 23 '09 at 17:20
  • True. At least with my installation of 5.8.8, Perl buffers 4096 bytes at a time, and there's no performance benefit to doing this manually -- as you suspected, if anything, it's actually worse. I like reminding people to think low-level though :) – ephemient Apr 23 '09 at 17:35
  • 1
    And what do you do about UTF-8 chars that are split across block boundaries, eh? – tchrist Feb 28 '11 at 18:49
1

To be able to count CHARS and not bytes, consider this:
(Try it with Chinese or Cyrillic letters and file saved in utf8)

use utf8;

my $file='file.txt';
my $LAYER = ':encoding(UTF-8)';
open( my $fh, '<', $file )
  || die( "$file couldn't be opened: $!" );
binmode( $fh, $LAYER );
read $fh, my $txt, -s $file;
close $fh;

print length $txt,$/;
use bytes;
print length $txt,$/;
Беров
  • 1,383
  • 10
  • 22
  • Perl defaults to using the system locale. If your system is modern, the system locale will be an UTF-8 encoding, and thus Perl IO will be UTF-8 by default. If not, you probably should be using the system locale and not forcing UTF-8 mode... – ephemient Apr 23 '09 at 21:00
  • Wrong, ephemient. Perl defaults to the system locale, but prints characters 128-255 as "?" for backwards compatibility. To print proper UTF-8, one should say binmode($fh, ":utf8"); before using the filehandle. In this case, "use utf8;" is useless - it tells Perl that the source code can be in UTF-8, which is unnecessary unless you have variables names like $áccent or $ümlats. – Chris Lutz Apr 24 '09 at 09:47
  • @Chris Both my Perl 5.8 and 5.10 are documented as having `-C SDL` as the default, and `perl -e 'print "\xe2\x81\x89\n"'` produces "⁉" as expected -- not "???" as you seem to expect. – ephemient Apr 24 '09 at 15:24
  • I think those three hex values are combining into one UTF-8 character. And UTF-8 characters will print in Perl. Just not the ones from 128-255. Trying any one of those three hex codes individually on my machine gives me "?", whereas prefixing it with binmode(STDOUT, ":utf8"); gives me "â" for \xe2 and non printing characters for the other two. And as far as I can tell, I have no default setting of "-C" anything. – Chris Lutz Apr 24 '09 at 19:17
  • `echo $'\xe2\x80\x99' | perl -ne'print length,$/'` outputs 4 while `echo $'\xe2\x80\x99' | perl -CSDL -ne'print length,$/'` outputs 2, so I must be misremembering and Chris is correct. – ephemient Apr 24 '09 at 20:47
  • @Berov: You’ve got the right idea!! But see my solution for a rather more elaborate version. – tchrist Feb 28 '11 at 23:08
0

This may be helpful to Perl beginners. I tried to simulate MS word counting functionalities and added one more feature which is not shown using wc in Linux.

  • number of lines
  • number of words
  • number of characters with space
  • number of characters without space (wc will not give this in its output but Microsoft words shows it.)

Here is the url: Counting words,characters and lines in a file

Jassi
  • 521
  • 6
  • 31