0

I've got some code that a friend of mine helped create:

 1  use LWP::Simple;
 2  use HTML::TreeBuilder;
 3  use Data::Dumper;
 4   
 5  my $tree = url_to_tree( 'http://www.registrar.ucla.edu/schedule/schedulehome.aspx' );
 6   
 7  my @selects  = $tree->look_down( _tag => 'select' );
 8  my @quarters = map { $_->attr( 'value' ) } $selects[0]->look_down( _tag => 'option' );
 9  my @courses  = map { my $s = $_->attr( 'value' ); $s =~ s/&/%26/g; $s =~ s/ /+/g; $s } $selects[1]->look_down( _tag => 'option' );
10   
11  my $n = 0;
12   
13  my %hash;
14   
15  for my $quarter ( @quarters )
16  {
17      for my $course ( @courses )
18      {
19          my $tree_b = url_to_tree( "http://www.registrar.ucla.edu/schedule/crsredir.aspx?termsel=$quarter&subareasel=$course" );
20         
21          my @options = map { my $s = $_->attr( 'value' ); $s =~ s/&/%26/g; $s =~ s/ /+/g; $s } $tree_b->look_down( _tag => 'option' );
22         
23          for my $option ( @options )
24          {
25           
26           
27              print "trying: http://www.registrar.ucla.edu/schedule/detselect.aspx?termsel=$quarter&subareasel=$course&idxcrs=$option\n";
28             
29              my $content = get( "http://www.registrar.ucla.edu/schedule/detselect.aspx?termsel=$quarter&subareasel=$course&idxcrs=$option" );
30             
31              next if $content =~ m/No classes are scheduled for this subject area this quarter/;
32             
33              $hash{"$course-$option"} = 1;
34              #my $tree_c = url_to_tree( "http://www.registrar.ucla.edu/schedule/detselect.aspx?termsel=$quarter&subareasel=$course&idxcrs=$option" );
35             
36              #my $table = ($tree_c->look_down( _tag => 'table' ))[2]->as_HTML;
37             
38              #print "$table\n\n\n\n\n\n\n\n\n\n";
39             
40              $n++;
41          }
42      }
43  }
44   
45  my $hash_count = keys %hash;
46  print "$n, $hash_count\n";
47   
48  sub url_to_tree
49  {
50      my $url = shift;
51     
52      my $content = get( $url );
53   
54      my $tree = HTML::TreeBuilder->new_from_content( $content );
55     
56      return $tree;
57  }

I'm having trouble understanding what lines 33 and 45 are doing. I think for the most part I get what everything else is doing, namely that @selects puts all the things contained in the two select tags in the master .aspx file on the website under consideration--I think the size of @selects is 2. I also get that from this point the 0-th slot of @selects is passed into @quarters, and similarly the position-1 slot is passed into @courses. Every unique match is enumerated and so n is the total number of courses offered throughout the year. Now, what I don't get is what $hash_count is enumerating. I suspect it is the number of unique courses offered, so where as n is an animal something akin to (in pseudocode)

sizeof( ['math1 FALL 2014' , 'math1 SPRING 2014'] ) = 2

I suspect hash_count is an animal like

sizeof( ['math1 FALL 2014' , 'math1 SPRING 2014'] ) = 1

Right?

user3333975
  • 125
  • 10

2 Answers2

3

The purpose of a Hash in this instance is to make sure that duplicates are being removed from the two arrays you are processing.

It's a basic principle, the "Hash" is being built up with your "course" and "option" elements. When something new is there it creates a new entry. When something already exists the value is just updated, as here:

$hash{"$course-$option"} = 1;

At the end the keys statement gets all the keys of the hash created. In this (scalar) context it just returns the number of keys, hence the count.

my $hash_count = keys %hash;

Basically the code is removing duplicates.

Some reading on hashes may be suggested.

But here are the basics:

Say we already have hash defined like this:

my %hash = ( one => 1, two => 2, three => 3 );

We can assign a new value to the hash like this:

$hash["four"] = 4;

And the new contents will be:

( one => 1, two => 2, three => 3, four => 4 )

But if use a "key" that already "exists" like this

$hash["two"] = 5;

The resulting contents will be this

( one => 1, two => 5, three => 3, four => 4 )

So we don't add an additional entry, the existing key simply has it's value updated. There is only one entry for "two" and there are no duplicate values of "two".

We can, as in the final part of the code get the keys of the hash as in the following:

my @keys = keys %hash;

And this will return a list that looks like this:

( 'one', 'two', 'three', 'four' )

They won't be in that order, but just not to complicate. But if we are not returning to something that will accept a list, as here:

my $count = keys %hash;

Then what is returned is the number of items contained within the hash:

print "$count\n";

Will output 4 as the result.

The code collects the unique occurrences of the combined "course" and "option" values, makes sure they are unique by storing that as a key in the hash. Then finally it returns the count of the keys to your variable $hash_count. Then prints the result.

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
  • but there shouldn't be any duplicates. $n and $hash_count should end up the same. unless, I guess, the select lists had duplicate options? but if they do, it would be much better to deduplicate earlier and avoid getting duplicate webpages. – ysth Mar 05 '14 at 06:33
  • I see. When I run it I get ten thousand some odd entries for `n` and seven thousand some odd entries for `hash_count`. I want to report the value to my university's website administrator--at registrar--to compare values. Is this a robust way to enumerate unique courses? I mean, _vis-a-vis_ to this has business. – user3333975 Mar 05 '14 at 06:35
  • @user3333975 It's perfectly sound. As a hash consists of key/value pairs, the process here is using the inherent uniquenes of the key in order to *weed out* duplicates. You also have an additional test on hitting the web page which skips to the next loop before a value is added where this does not match. – Neil Lunn Mar 05 '14 at 06:41
  • @ysth an optimization may be to test if the hash key **exists** before doing any other processing. But this is for someone to understand what a hash is. It's removing the duplicates from array entries. Returning the unique count. – Neil Lunn Mar 05 '14 at 06:44
  • oh, my mistake; I didn't see there were three loops. There are indeed possibly duplicate course/option combinations in different quarters. – ysth Mar 05 '14 at 07:11
  • @NeilLunn, what does "duplicates" mean in this context? I mean, and I've asked this to my friend, what exactly--in clear English--is `hash_count`? – user3333975 Mar 05 '14 at 09:02
  • @user3333975 The full explain of Hash, Map, Dictionary is kind of lengthy and out of scope for this, hence the link for you to look, and hoping you have some programming understanding if not perl. But if I get your absolute meaning `$hash_count` is just a variable, to which a **value** has been assigned, based on the **number** of **keys** found in the **hash**. So it's the actual "hash" assignment that is doing the work, and the **keys** part gives the value. There was more detail in the answer, but I hope that fills in the gaps. – Neil Lunn Mar 05 '14 at 12:31
  • @NeilLunn, That's not what I meant at all. I meant, if you had to explain to the public on CNN what $hash_count was doing in this program, what would you say? – user3333975 Mar 05 '14 at 17:46
2
  • Line 33 stores $course-$option as a key in the hash, with 1 as its associated value. Why? Hashes provide a convenient and quick mechanism for lookups. Those values could instead have been stored in an array, but subsequent lookups (to test whether a given key has been seen before) would not be nearly as quick.
  • Line 45 is a syntactically dense statement, but it is essentially storing the number of keys in the hash. The keys function returns an array containing--you guessed it--all of the keys in the hash. However, since the variable to which it is being assigned ($hash_count) is a scalar, the array is being evaluated in scalar context. An array evaluated in scalar context is simply the number of entries in that array.
Daniel Standage
  • 8,136
  • 19
  • 69
  • 116