-1

So lets say I had the string.

 $my str = "Hello how are you today. Oh thats good I'm glad you are happy. Thats wonderful; thats fantastic."

I want to create a hash table where each key is a unique word and the value is the number of times it appears in the string i.e., I want it to be an automated process.

my %words {
  "Hello" => 1,
  "are" => 2,
  "thats" => 2,
  "Thats" => 1
  };

I honestly am brand new to PERL and have no clue how to do this, how to handle the punctuation etc.

UPDATE:

Also, is it possible to use

   split('.!?;',$mystring)   

Not with this syntax, but basically split at a . or ! or ? etc.. oh and ' ' (whitespace)

SystemFun
  • 1,062
  • 4
  • 11
  • 21
  • How do you *want* to handle punctuation is the question. Is `I'm` a duplicate of `I am`, or should it only be a duplicate of itself? Is `ultra-complex` a duplicate of `ultracomplex` or not? – TLP Feb 19 '13 at 22:13
  • Anything that is different in anyway should be different. I meant punctuation like .'s !'s ;'s and ?'s. Sorry. – SystemFun Feb 19 '13 at 22:14
  • You'll find some hints [here](http://stackoverflow.com/questions/782087/how-do-i-count-the-characters-words-and-lines-in-a-file-using-perl). – Craig Treptow Feb 19 '13 at 22:14
  • also somewhat related: http://stackoverflow.com/questions/8252547/how-to-split-a-string-with-multiple-patterns-in-perl – amphibient Feb 19 '13 at 22:46

4 Answers4

4

One simple way to do it is to split the string on any character that is not a valid word-character in your view. Note that this is by no means an exhaustive solution as it is. I have simply taken a limited set of characters.

You can add valid word-characters inside the brackets [ ... ] as you discover edge cases. You might also search http://search.cpan.org for modules designed for this purpose.

The regex [^ ... ] means match any character that is not inside the brackets. \pL is a larger subset of letters, and the others literal. Dash - must be escaped because it is a meta character inside a character class bracket.

use strict;
use warnings;
use Data::Dumper;

my $str = "Hello how are you today. Oh thats good I'm glad you are happy.
           Thats wonderful; thats fantastic.";
my %hash;
$hash{$_}++                      # increase count for each field
    for                          # in the loop
    split /[^\pL'\-!?]+/, $str;  # over the list from splitting the string 
print Dumper \%hash;

Output:

$VAR1 = {
          'wonderful' => 1,
          'glad' => 1,
          'I\'m' => 1,
          'you' => 2,
          'how' => 1,
          'are' => 2,
          'fantastic' => 1,
          'good' => 1,
          'today' => 1,
          'Hello' => 1,
          'happy' => 1,
          'Oh' => 1,
          'Thats' => 1,
          'thats' => 2
        };
TLP
  • 66,756
  • 10
  • 92
  • 149
  • Ok thanks. how do I account for the fact that Thats is NOT supposed to be thats. – SystemFun Feb 19 '13 at 22:27
  • @Vlad You want to distinguish between upper and lower case? Then change `lc($_)` to just `$_`. I'll remove it. – TLP Feb 19 '13 at 22:28
  • ok thanks. I'll have to work on learning that syntax. Is there an error in yours? Why is everything red i.e. a string? – SystemFun Feb 19 '13 at 22:33
  • @Vlad Red? Are you talking about stackoverflow's code highlighting? That's just the single quote making it think its a quoted string. – TLP Feb 19 '13 at 22:34
  • I figured that out haha. Thanks for your help! – SystemFun Feb 19 '13 at 22:35
  • so if I wanted it to ignore a ; then I would just add it inbetween the ! and ? – SystemFun Feb 19 '13 at 22:37
  • Well, that would be a conflict, since `a` is a part of the character class `\pL`. – TLP Feb 19 '13 at 22:39
  • I meant ignore a semicolon as in the function splits on semicolons, not the letter a, sorry that's my fault. – SystemFun Feb 19 '13 at 22:40
  • @Vlad Yes, you can add `;` and any other punctuation you want. – TLP Feb 19 '13 at 22:41
  • ok I think I can handle it. Thank's for your help. Havent' learned a darned thing about regex – SystemFun Feb 19 '13 at 22:45
  • @Vlad Practice makes perfect. – TLP Feb 19 '13 at 22:50
  • What is the / and +/ surrounding the brackets indicate? – SystemFun Feb 19 '13 at 23:08
  • @Vlad The slashes are the regex delimiters of the `m//` operator, the plus sign is the quantifier that means `1 or more of the previous`. Read more in the perldoc documentation, for example perldoc perlop, or perlretut. – TLP Feb 19 '13 at 23:12
1

This will use whitespace to separate words.

#!/usr/bin/env perl
use strict;
use warnings;

my $str = "Hello how are you today."
        . " Oh thats good I'm glad you are happy."
        . " Thats wonderful. thats fantastic.";

# Use whitespace to split the string into single "words".
my @words = split /\s+/, $str;

# Store each word in the hash and count its occurrence.
my %hash;
for my $word ( @words ) {
    $hash{ $word }++;
}

# Show each word and its count. Using printf to align output.
for my $key ( sort keys %hash ) {
    printf "\%-10s => \%d\n", $key, $hash{ $key };
}

You will need some fine-tuning to get "real" words.

Hello      => 1
I'm        => 1
Oh         => 1
Thats      => 1
are        => 2
fantastic. => 1
glad       => 1
good       => 1
happy.     => 1
how        => 1
thats      => 2
today.     => 1
wonderful. => 1
you        => 2
Perleone
  • 3,958
  • 1
  • 26
  • 26
  • he needs more delimiters than just space. see what i did: `my @strAry = split /[:,\.\s\/]+/, $str;` – amphibient Feb 19 '13 at 22:32
  • That's what the "will need some fine-tuning" is for. Waiting for homeworkoverflow.com so I can post it there. ;-) – Perleone Feb 19 '13 at 22:34
  • @Perleone so for PERL, I can put a variable in the {} and it will just add it to the hash? How do I access a variable then? And if that variable is already in the Hash what happens? – SystemFun Feb 19 '13 at 22:39
  • @Vlad Yes. `$hash{beer} = 5;` adds the key `beer` with the value `5` to `%hash`. You access it the same way: `print $hash{beer};` will output `5`. If the key is already present, the value will be overwritten: `$hash{beer} = 3;`. – Perleone Feb 19 '13 at 22:43
  • oh ok. so the ++ will just add one to the present value? – SystemFun Feb 19 '13 at 22:44
  • @Vlad Exactly. And since the first time there is no value present, which equals 0, it will add 1 to 0 and result in 1. – Perleone Feb 19 '13 at 22:48
  • Ok thanks I think I understand now. Just need to learn the regex stuff. – SystemFun Feb 19 '13 at 22:48
1

Try this:

use strict;
use warnings;

my $str = "Hello, how are you today. Oh thats good I'm glad you are happy. 
           Thats wonderful.";
my @strAry = split /[:,\.\s\/]+/, $str;
my %strHash;

foreach my $word(@strAry) 
{
    print "\nFOUND WORD: ".$word;
    my $exstCnt = $strHash{$word};

    if(defined($exstCnt)) 
    {
        $exstCnt++;
    } 
    else 
    {
        $exstCnt = 1;
    }

    $strHash{$word} = $exstCnt;
}

print "\n\nNOW REPORTING UNIQUE WORDS:\n";

foreach my $unqWord(sort(keys(%strHash))) 
{
    my $cnt = $strHash{$unqWord};
    print "\n".$unqWord." - ".$cnt." instances";
}
amphibient
  • 29,770
  • 54
  • 146
  • 240
  • Why the double spaced formatting? You don't have to use the concatenation operator to interpolate variables, just enter them in the string `"Found word $word\n"`. You don't need to go over a transition variable to increment the counter, just increment it directly. – TLP Feb 19 '13 at 22:38
  • @Pfoampile so for PERL, I can put a variable in the {} and it will just add it to the hash? How do I access a variable then? And if that variable is already in the Hash what happens? – SystemFun Feb 19 '13 at 22:39
  • yes, @Vlad. `$strHash{'Vlad'} = 1;` adds key 'Vlad' to the hash and assigns value 1 to it – amphibient Feb 19 '13 at 22:44
  • @TLP, good point. but i think this more verbose way is more intelligible for our beginner Vlad to follow what is going on – amphibient Feb 19 '13 at 22:45
  • 1
    @foampile your explantion is very thourough. I have programmed before, just not in PERL, so it was easy to follow your logic. Thanks :) – SystemFun Feb 19 '13 at 22:47
  • @foampile Actually, I think it is more confusing. `$strHash{$word}++` certainly is more legible than what you wrote. And `"\n".$unqWord." - ".$cnt." instances"` is a mess compared to `"\n$unqWord - $cnt instances"` – TLP Feb 19 '13 at 22:50
  • thanks, @Vlad. see, the thing with Perl is that you can do anything in at least like 7 different ways. that has its pros and cons as some ways are better than others. you can be more or less verbose. personally, i like to not compress a lot of functionality on one line and be more explicit so my code is easier to understand. then again, Perl is not my core skill (Java is). most Perl programmers write code that is less verbose but also harder to consume visually than i do. just my $0.02 – amphibient Feb 19 '13 at 22:52
  • @TLP, to each their own. i prefer that each line of code be simpler and not have multiple operations compressed into it. – amphibient Feb 19 '13 at 22:53
  • @foampile No, I'm sorry, but you are wrong. To even suggest that `my $exstCnt = $strHash{$word}; if(defined($exstCnt)) { $exstCnt++; } else { $exstCnt = 1; } $strHash{$word} = $exstCnt;` is less complicated than `$strHash{$word}++` is just ludicrous. Surely you see that? Also, what is it that you find "simpler" about `"\nFOUND WORD: ".$word` as compared to `"FOUND WORD: $word\n"`? – TLP Feb 19 '13 at 23:09
  • because when you say `$strHash{$word}++`, the beginner needs to know (i.e. is not EXPLICIT) that if $word does not exist in the hash, it will be entered automatically and then the value will be incremented. the code i wrote is a decompressed, digestable, school example that break the function into the granular units, each of which does one thing only. – amphibient Feb 19 '13 at 23:16
  • @TLP, production-grade coding style is not necessarily something you want to present to a rookie. i am talking really more about teaching methodology than what productionalized, efficient code needs to look like – amphibient Feb 19 '13 at 23:19
  • @TLP, { on the next line is a personal preference. i write Java like that as well as Perl. certainly, you are being nitpicky changing that – amphibient Feb 19 '13 at 23:20
  • @foampile I prefer to explain my shortcuts rather than not showing them at all. These particular operations are rather self-explanatory though, IMO. It is of course a good thing to be extra clear sometimes, but you have to stay out of your comfort zone in order to learn anything, I think. I did not edit your question for the curly brackets, I edited it for the double spacing. – TLP Feb 19 '13 at 23:33
0
 use YAML qw(Dump);
 use 5.010;

 my $str = "Hello how are you today. Oh thats good I'm glad you are happy. Thats wonderful; thats fantastic.";
 my @match_words = $str =~ /(\w+)/g;
 my $word_hash = {};
 foreach my $word (sort @match_words) {
     $word_hash->{$word}++;
 }
 say Dump($word_hash);
 # -------output----------
 Hello: 1
 I: 1
 Oh: 1
 Thats: 1
 are: 2
 fantastic: 1
 glad: 1
 good: 1
 happy: 1
 how: 1
 m: 1
 thats: 2
 today: 1
 wonderful: 1
 you: 2