2

I'm a novice perl programmer trying to identify which elements are in one array of hashes but not in another. I'm trying to search through the "new" array, identifying the id, title, and created elements that don't exist from the "old" array.

I believe I have it working with a set of basic for() loops, but I'd like to do it more efficiently. This only came after having tried to use grep() and failed.

These arrays are built from a database as such:

use DBI;
use strict;
use Data::Dumper;
use Array::Utils qw(:all);
sub db_connect_new();
sub db_disconnect_new($);
sub db_connect_old();
sub db_disconnect_old($);

my $dbh_old   = db_connect_old();
my $dbh_new   = db_connect_new();

# get complete list of articles on each host first (Joomla! system)
my $sql_old   = "select id,title,created from mos_content;"; 
my $sql_new   = "select id,title,created from xugc_content;";

my $sth_old   = $dbh_old->prepare($sql_old);
my $sth_new   = $dbh_new->prepare($sql_new);

$sth_old->execute();
$sth_new->execute();

my $ref_old;
my $ref_new;

while ($ref_old = $sth_old->fetchrow_hashref()) {
  push @rv_old, $ref_old;
}

while ($ref_new = $sth_new->fetchrow_hashref()) {
  push @rv_new, $ref_new;
}

my @seen = ();
my @notseen = ();
foreach my $i (@rv_old) {
   my $id = $i->{id};
   my $title = $i->{title};
   my $created = $i->{created};
   my $seen = 0;
   foreach my $j (@rv_new) {
      if ($i->{id} == $j->{id}) {
         push @seen, $i;
         $seen = 1;
      }
   }
   if ($seen == 0) {
       print "$i->{id},$i->{title},$i->{state},$i->{catid},$i->{created}\n";
      push @notseen, $i;
   }
}

The arrays look like this when using Dumper(@rv_old) to print them:

$VAR1 = {
          'title' => 'Legal Notice',
          'created' => '2004-10-07 00:17:45',
          'id' => 14
        };
$VAR2 = {
          'created' => '2004-11-15 16:04:06',
          'id' => 86096,
          'title' => 'IRC'
        };
$VAR3 = {
          'id' => 16,
          'created' => '2004-10-07 16:15:29',
          'title' => 'About'
        };

I tried to use grep() using array references, but I don't think I understand arrays, hashes, and references well enough to do it properly. My failed grep() attempts are below. I'd appreciate any ideas of how to do this properly.

I believe the problem with this is that I don't know how to reference the id field in the second array of hashes. Most of the examples using grep() that I've seen are to just look through an entire array, like you would with regular grep(1). I need to iterate through one array, checking each of the values from the id field with the id field from another array.

  my $rv_old_ref        = \@rv_old;
  my $rv_new_ref        = \@rv_new;

  for my $i ( 0 .. $#rv_old) {
    my $match = grep { $rv_new_ref->$_ == $rv_old_ref->$_ } @rv_new;
    push @notseen, $match if !$match;
  }

I also tried variations on the grep() above:

1) if (($p) = grep ($hash_ref->{id}, @rv_old)) {
2) if ($hash_ref->{id} ~~ @rv_old) {
Alex Regan
  • 477
  • 5
  • 16

3 Answers3

3

There are a number of libraries that compare arrays. However, your comparison involves complex data structures (the arrays have hashrefs as elements) and this at least complicates use of all modules that I am aware of.

So here is a way to do it by hand. I use the shown array and its copy with one value changed.

use warnings;
use strict;
use feature 'say';

use List::Util qw(none);   # in List::MoreUtils with older Perls
use Data::Dump qw(dd pp);

sub hr_eq {
    my ($e1, $e2) = @_; 
    return 0 if scalar keys %$e1 != scalar keys %$e2;
    foreach my $k1 (keys %$e1) {
       return 0 if !exists($e2->{$k1}) or $e1->{$k1} ne $e2->{$k1};            
    }   
    return 1
}

my @a1 = ( 
    { 'title' => 'Legal Notice', 'created' => '2004-10-07 00:17:45', 'id' => 14 },
    { 'created' => '2004-11-15 16:04:06', 'id' => 86096, 'title' => 'IRC' },  
    { 'id' => 16, 'created' => '2004-10-07 16:15:29', 'title' => 'About' }
);        
my @a2 = ( 
    { 'title' => 'Legal Notice', 'created' => '2004-10-07 00:17:45', 'id' => 14 },
    { 'created' => '2004-11-15 16:xxx:06', 'id' => 86096, 'title' => 'IRC' },  
    { 'id' => 16, 'created' => '2004-10-07 16:15:29', 'title' => 'About' }
);

my @only_in_two = grep { 
    my $e2 = $_; 
    none { hr_eq($e2, $_) } @a1;
} @a2;

dd \@only_in_two;

This correctly identifies the element in @a2 that doesn't exist in @a1 (with xxx in timestamp).

Notes

  • This finds what elements of one array are not in another, not the full difference between arrays. It is what the question specifically asks for.

  • The comparison relies on details of your data structure (hashref); there's no escaping that, unless you want to reach for more comprehensive libraries (like Test::More).

  • This uses string comparison, ne, even for numbers and timestamps. See whether it makes sense for your real data to use more appropriate comparisons for particular elements.

  • Searching through a whole list for each element of a list is an O(N*M) algorithm. Solutions of such (quadratic) complexity are usable as long as data isn't too big; however, once data gets big enough so that size increases have clear effects they break down rapidly (slow down to the point of being useless). Time it to get a feel for this in your case.

    An O(N+M) approach exists here, utilizing hashes, shown in ikegami answer. This is much better algorithmically, once the data is large enough for it to show. However, as your array carries complex data structure (hashrefs) a bit of work is needed to come up with a working program, specially as we don't know data. But if your data is sizable then you surely want to implement this.


Some comments on filtering.

The question correctly observes that for each element of an array, as it's processed in grep, the whole other array need be checked.

This is done in the body of grep using none from List::Util. It returns true if the code in its block evaluates false for all elements of the list; thus, if "none" of the elements satisfy that code. This is the heart of the requirement: an element must not be found in the other array.

Care is needed with the default $_ variable, since it is used by both grep and none.

In grep's block $_ aliases the currently processed element of the list, as grep goes through them one by one; we save it into a named variable ($e2). Then none comes along and in its block "takes possession" of $_, assigning elements of @a1 to it as it processes them. The current element of @a2 is also available since we have copied it into $e2.

The test performed in none is pulled into a a subroutine, which I call hr_eq to emphasize that it is specifically for equality comparison of (elements in) hashrefs.

It is in this sub where the details can be tweaked. Firstly, instead of bluntly using ne for values for each key, you can add custom comparisons for particular keys (numbers must use ==, etc). Then, if your data structures change this is where you'd adjust specifics.

zdim
  • 64,580
  • 5
  • 52
  • 81
  • Yeah, there's no way I could understand what you've done :-( I will try and figure it out. – Alex Regan Mar 08 '19 at 22:38
  • @AlexRegan Oh, sorry ... Your own code isn't far off, so I thought this would be clear with some thought. I can add explanations -- which part is opaque? – zdim Mar 08 '19 at 22:40
  • This is a slow approach (O(N*M) [basically O(N^2)]) compared to the faster one available (O(N+M) [basically, O(N)]). – ikegami Mar 09 '19 at 06:27
  • @ikegami Yeah. I considered (briefly, was going to return to this), and went for something related to their `grep` efforts. Also, using hashes (for a working example) didn't seem so clean as the whole hashref need be accounted for somehow. We don't even know whether any keys may be considered unique (except for duplicate hashrefs). I've added a bullet on this. Thank you. – zdim Mar 09 '19 at 07:55
  • Re "*We don't even know whether any keys may be considered unique*", The OP actually mentions they only want to report missing `id` values, but the hash approach can still be used with multi-field keys by using something like `sub key { pack '(J/a*)*', @{ $_[0] }{ @key_fields } }` to generate the hash key. – ikegami Mar 09 '19 at 08:16
  • @ikegami "_only want to report missing `id` values_" -- yes, you're right; I somehow ignored that (I guess expecting a more complex real problem?). But yes, that packin' is a kind of a thing I had in mind :) Thanks, going to study that some :)) – zdim Mar 09 '19 at 08:26
  • `J/a*` creates a length-prefixed string (`J` = unsigned integer of the size used by Perl). `(J/a*)*` creates one for each other argument. These length-prefixed strings are concatenated. So, `pack('(J/a*)*', 'abc', 'def')` could create `03 00 00 00 00 00 00 00 61 62 63 03 00 00 00 00 00 00 00 64 65 66` – ikegami Mar 09 '19 at 08:33
  • @AlexRegan "_no way I could understand what you've done_" --- I had added detailed explanations (in case you haven't noticed). If they aren't good enough you can let me know – zdim Jun 19 '19 at 05:23
2

You could use grep.

for my $new_row (@new_rows) {
   say "$new_row->{id} not in old"
      if !grep { $_->{id} == $new_row->{id} } @old_rows;
}

for my $old_row (@old_rows) {
   say "$old_row->{id} not in new"
      if !grep { $_->{id} == $old_row->{id} } @new_rows;
}

But that's an O(N*M) solution, while there exists an O(N+M) solution that would be far faster.

my %old_keys;  ++$old_keys{ $_->{id} } for @old_rows;
my %new_keys;  ++$new_keys{ $_->{id} } for @new_rows;

for my $new_row (@new_rows) {
   say "$new_row->{id} not in old"
      if !$old_keys{$new_row->{id}};
}

for my $old_row (@old_rows) {
   say "$old_row->{id} not in new"
      if !$new_keys{$old_row->{id}};
}

If both of your database connections are to the same database, this can be done far more efficiently within the database itself.

  1. Create a temporary table with three fields, id, old_count (DEFAULT 0) and new_count (DEFAULT 0).
  2. INSERT OR UPDATE from the old table into the temporary table, incrementing old_count in the process.
  3. INSERT OR UPDATE from the new table into the temporary table, incrementing new_count in the process.
  4. SELECT the rows of the temporary table which have 0 for old_count or 0 for new_count.
ikegami
  • 367,544
  • 15
  • 269
  • 518
1
select id,title,created from mos_content
     LEFT JOIN xugc_content USING(id)
     WHERE xugc_content.id IS NULL;

Gives you the rows that are in mos_content but not in xugc_content.

That's even shorter than the Perl code.

Rick James
  • 135,179
  • 13
  • 127
  • 222