2

I am trying to write a program wit Perl which should returns the frequency of all words in the file and the length of each word in the file (not the sum of all characters!) to produce a Zipf curve from a Spanish text (is not a big deal if you don't know what a Zipf's curve is). Now my problem is: I can do the first part and I get the frequency of all word but I don't how to get the length of each word! :( I know the command line $word_length = length($words) but after trying to change the code I really don't know where I should include it and how to count the length for each word.

That's how my code looks like until know:

#!/usr/bin/perl
use strict;
use warnings;

my %count_of;
while (my $line = <>) { #read from file or STDIN
  foreach my $word (split /\s+/gi, $line){
     $count_of{$word}++;
  }
}
print "All words and their counts: \n";
for my $word (sort keys %count_of) {
  print "$word: $count_of{$word}\n";
}
__END__

I hope somebody have any suggestions!

BenMorel
  • 34,448
  • 50
  • 182
  • 322
El_Patrón
  • 533
  • 1
  • 10
  • 24
  • You may wish to check this question: http://stackoverflow.com/questions/6170985/counting-individual-words-in-a-text-file When you do a split like yours, you will end up with `word`, `Word` and `word,` all being treated like different words, which may not be what you want. – TLP May 31 '11 at 17:22

3 Answers3

2

You can use hash of hashes if you want to store the length of the word.

while (my $line = <>) {
    foreach my $word (split /\s+/, $line) {
        $count_of{$word}{word_count}++;
        $count_of{$word}{word_length} = length($word);
    }
}

print "All words and their counts and length: \n";
for my $word (sort keys %count_of) {
    print "$word: $count_of{$word}{word_count} ";
    print "Length of the word:$count_of{$word}{word_length}\n";
}
Shalini
  • 455
  • 1
  • 3
  • 5
1

This will print the length right next to the count:

  print "$word: $count_of{$word} ", length($word), "\n";
toolic
  • 57,801
  • 17
  • 75
  • 117
  • 1
    Oh, thanks for the fast answer! it works fine. I wrote it like this: print $word, "\t", $count_of{$word}, "\t", length($word), "\n"; – El_Patrón May 31 '11 at 17:36
0

Just for your information - the other possibility for

length length($word)

might be:

$word =~ s/(\w)/$1/g

It is not as clear solution as toolic but can give you other view on this issue (TIMTOWTDI :))

Little explanation:

\w and g modifier matches every letter in your $word

$1 prevents overwriting original $word by s///

s/// returns number of letters (matched with \w) in $word

czubatka
  • 61
  • 3
  • 1
    You meant `$count = $word =~ s/(\w)//g;` will get the number of letters. ;) – TLP May 31 '11 at 17:18
  • @TLP: Check this: `my $word = "word"; print $word =~ s/(\w)/$1/g;` The output is: `7` Without **$1** you will overwrite your **$word** with number of counted letters. – czubatka May 31 '11 at 19:33
  • Too fast - the output is **4** :) – czubatka May 31 '11 at 19:40
  • Yes, I know. ;) Oh, I see, I forgot `$1` in my comment, my bad. I meant that if you put `$count` before, you will store the number returned from the `s///` in it. So: `$count = $word =~ s/(\w)/$1/g` – TLP May 31 '11 at 19:48