Remove duplicate keys from a hash of hashes and arrays (and ensure any resulting empty hashs are also removed)

Question

I have a data stream in JSON format that my script accesses from an internal website. My script converts the JSON to a perl hash using JSON.pm (I'm using perl 5.10.1 on RHEL 6.9)

Within this hash are multiple nested hashes, and nested arrays, some of which are nested within other hashes/arrays inside of the big hash.

I need to walk the entire structure of the hash, including all of the arrays and nested hashes, and remove any keys anywhere in the entire structure, which share the same name as any other key (only for a specific key name though).

Additionally, because of how the data is structured, some nested hashes have ONLY keys that are now deleted, leaving the value for some keys as an empty hash. I also need to remove those keys which have an empty hash for its value

Here is my data after its conversion to perl:

$VAR1 = {
          'cat' => 'meow',
          'dog' => [
                     {
                        'a' => {
                                  'husky' => {
                                                'name' => 'fred'
                                             },
                                  'chow' => {
                                               'name' => 'barney'
                                            }
                               },
                     },
                     {
                        'b' => {
                                  'husky' => 'wilma',
                                  'lab' => 'betty'
                               },
                        'c' => 'pebbles' # yes this is intentionally a scalar in the example
                     },
                     {
                        'd' => {
                                  'shihtzu' => 'bambam'
                               },
                     },
                     {
                        'e' => {
                                  'husky' => 'dino'
                               },
                     },
                   ],
        }

We want to remove all keys named 'husky'

Here is what it should look like:

$VAR1 = {
          'cat' => 'meow',
          'dog' => [
                     {
                        'a' => {
                                  'chow' => {
                                               'name' => 'barney'
                                            }
                               },
                     },
                     {
                        'b' => {
                                  'labrador' => 'betty'
                               },
                        'c' => 'pebbles'
                     },
                     {
                        'd' => {
                                  'shihtzu' => 'bambam'
                               },
                     },
                   ],
        }

Here is what I get after I added @Shawn's code and made a tweak to it (this is very close, but we need to account for the empty hashes:

$VAR1 = {
          'cat' => 'meow',
          'dog' => [
                     {
                        'a' => {
                                  'chow' => {
                                               'name' => 'barney'
                                            }
                               },
                     },
                     {
                        'b' => {
                                  'lab' => 'betty'
                               },
                        'c' => 'pebbles' # yes this is intentionally a scalar in the example
                     },
                     {
                        'd' => {
                                  'shihtzu' => 'bambam'
                               },
                     },
                     {
                        'e' => {},
                     },
                   ]
        }

I've tried a few variations found elsewhere on SO and on perlmonks. keys %$_ == 0, !%$_ to name a few. But none seem to work with this hash slice.

Code:

use 5.008008;
use strict;
use warnings;
use English; # I know I know, don't use English...
use JSON;
use YAML::Tiny qw(Dump);
# proprietary modules I wrote added here, which themselves load in LWP, HTTP::Cookie and others, and they do the bulk of building and sending the request. They are the back end to this script's front end.

[-snipped a ton of code-]

sub _count_keys
{
    my ($j, $seen) = @ARG;
    my $type = ref $j;
    if ($type eq "ARRAY")
    {
        for (@{$j})
        {
            _count_keys($ARG, $seen);
        }
    }
    elsif ($type eq "HASH")
    {
        while (my ($key, $val) = each %{$j})
        {
            $seen->{$key}++;
            if (ref $val)
            {
                _count_keys($val, $seen);
            }
        }
    }
    return $seen;
}

sub _remove_duplicate_keys
{
    my ($j, $seen) = @ARG;
    $seen //= _count_keys($j, {});

    my $type = ref $j;
    if ($type eq "ARRAY")
    {
        return [ map { _remove_duplicate_keys($ARG, $seen) } @{$j} ];
    }
    elsif ($type eq "HASH")
    {
        my %obj = %{$j};
        delete @obj{grep { $seen->{$ARG} > 1 and $ARG eq 'keyNameToBeExcluded'} keys %obj};
# Here is where I have been putting another delete line but I can't seem to find the right parameters for the grep to make it delete the empty anon hashes. Example of what I tried is the next comment below
#        delete @obj{grep { $seen->{$ARG} > 1 and keys $ARG{assetDetails} == 0 } keys %obj};

        while (my ($key, $val) = each %obj)
        {
            if (ref $val)
            {
                $obj{$key} = _remove_duplicate_keys($val, $seen);
            }
        }
        return \%obj;
    }
    else
    {
        return $j;
    }
}

sub _process_json
{
    my $JSONOUTPUT   = shift;
    my $OPTIONS      = shift;

    # Change true to 1 and false to 0 to prevent blessed objects from appearing in the JSON, which prevents the YAML::Tiny module from barfing
    foreach (@{$JSONOUTPUT})
    {
        s{true(,\n)}{1$1}gxms;
        s{false(,\n)}{0$1}gxms;
    }

    my $JSONPERLOBJ  = JSON->new->utf8->decode(@{$JSONOUTPUT});

# Test code below here; real code not in use while I test getting the output right.
use Data::Dumper;
my $BEFORE = $JSONPERLOBJ;
my $AFTER = _remove_duplicate_keys($JSONPERLOBJ);
#    $JSONPERLOBJ = _remove_duplicate_keys($JSONPERLOBJ);
#print Dumper $BEFORE;
print Dumper $AFTER;
exit 1;
# End test code
}
sub _main
{
    [-snip private code-]
    my @JSONOUTPUT = $RESPONSE->decoded_content;
    my $RC = _process_json(\@JSONOUTPUT, $OPTIONS);

    exit ($RC == 1)?0:1;
}

`[ ... ]` creates an *array*. Please `use Data::Dumper` and `print Dumper \%obj` so that we can see what you actually have. Also, in Perl an "object" is a data item that has been blessed into a class. What you have here is just a data structure. — Borodin, Sep 19 '18 at 23:30
Is this (a representation of) what JSON returns? That would be a hashref, `my $hr = {...}`. You are showing an arrayref `[...]`, asigned to a hash (`%`) variable. — zdim, Sep 19 '18 at 23:47
Edited the OP. No reason to downvote over a simple typo. I cannot give you a dump because it contains thousands of lines of information, most of which is sensitive. Take the example as I gave it as that is all I can provide. @zdim yes its a representation of the JSON data. I manually typed the example and used the wrong notation. My bad. Can we focus on the actual question now? — Speeddymon, Sep 20 '18 at 01:02
@Speeddymon which key value should persist, or are all the values, with dup keys, the same? — Rafael, Sep 20 '18 at 01:03
@Rafael All of the keys I'm trying to delete have different values, however I do want every instance of the given key to be blown away, there is no reason to keep this particular key, and no way to filter it out before it's given to my script as JSON data. — Speeddymon, Sep 20 '18 at 01:05
Thank you for your update, but the data structure that you show is impossible, and if you tried to compile it you would see `Odd number of elements in anonymous hash` errors. Please dump your *real data*, otherwise we cannot help you, — Borodin, Sep 20 '18 at 01:18
The OP can certainly use improvement, but it's not difficult to understand what he's trying to do, for he has JSON and he wants to remove entries with duplicate keys found *anywhere* in the JSON data. — Rafael, Sep 20 '18 at 01:22
*"No reason to downvote over a simple typo"* It is much more likely that the downvote was because you haven't made any effort at all to solve this problem yourself. *"Can we focus on the actual question now?"* Don't be so rude to people who are offering you programming solutions for free, or you will quickly find that no one wants to help you. — Borodin, Sep 20 '18 at 01:22
@Rafael: *"it's not difficult to understand what he's trying to do"* I think it's pretty much impossible. If you want to take a chance and guess what the data structure really looks like then do so, but don't criticise those of us who don't like writing answers based on wild guesses. — Borodin, Sep 20 '18 at 01:25
@Rafael: You know that the data you've edited into the question is invalid Perl, right? — Borodin, Sep 20 '18 at 02:04
@Speeddymon: *"In my example below, I'm going to only list out hashes, not arrays, but please note that arrayrefs are in there too with some hashes inside of them"* So you can't be bothered to write some representative data. Don't you think that's a bit cheeky when you're asking for help for free? You've shown illegal Perl syntax that doesn't even try to represent your real data properly. That's terrible. — Borodin, Sep 20 '18 at 02:05
@Speeddymon -- I asked for clarification merely so that I know what data I am looking at. If I don't know whether it's a hash (the variable) or an arrayref (the structure) or a hashref (expected) I can't take the question. Thank you for clarification. Btw, I didn't downvote. — zdim, Sep 20 '18 at 03:18
Understood @zdim, apologies for my misunderstanding your intent. — Speeddymon, Sep 20 '18 at 15:38
@Borodin sensitive data is sensitive data. I cannot and will not lose my job. You want representative data, which is fine. I'll attempt to make a better representation in the question. **haven't made any effort at all to solve this problem yourself** really? I said I have a big complicated nested loop doing the work now. I'm trying to simplify it and I even said I was fairly sure the `map` function was what I needed, but I hadn't grasped how to use it properly yet. — Speeddymon, Sep 20 '18 at 15:44
@Speeddymon: *"I cannot and will not lose my job. You want representative data, which is fine"* Uh, so which is it? It sounds like your real data is too big to be useful anyway. I think I've been clear from the start that all we need is something that we can work with, but you have yet to post anything that is even valid Perl. Surely you're not beyond testing what you post? We have no code or useful data and still you rail at us for asking for something useful. — Borodin, Sep 20 '18 at 17:30
@Speeddymon: If your most recent edit is your latest stab at writing representative data then thank you for showing something that compiles, but it *still* contains no arrays, while you say that the real data does, and I don't understand why elements with key `b` shouldn't be deleted as well as those with `a`. And where is the desired result that you want? At this point I'm bailing out. You refuse to put in enough effort to explain your problem properly, and just get angry with people who point out that your post is inadequate and unanswerable. There's nothing more I can do for you. — Borodin, Sep 20 '18 at 17:47

Shawn · Accepted Answer · 2018-09-20T01:49:23.950

1

I think this does what you want:

#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
use JSON::XS; # Better than JSON; also see JSON::MaybeXS

my $j = <<EOJSON;
{
  "foo": 1,
  "bar": {
      "foo": true,
      "baz": false
      },
  "dog": "woof",
  "cat": [ { "foo": 3 } ]
}
EOJSON

sub count_keys {
  my ($j, $seen) = @_;
  my $type = ref $j;
  if ($type eq "ARRAY") {
    count_keys($_, $seen) for @$j;
  } elsif ($type eq "HASH") {
    while (my ($key, $val) = each %$j) {
      $seen->{$key}++;
      count_keys($val, $seen) if ref $val;
    }
  }
  return $seen;
}

sub remove_dups {
  my ($j, $seen) = @_;
  $seen //= count_keys($j, {});

  my $type = ref $j;
  if ($type eq "ARRAY") {
    return [ map { remove_dups($_, $seen) } @$j ];
  } elsif ($type eq "HASH") {
    my %obj = %$j;
    delete @obj{grep { $seen->{$_} > 1 } keys %obj};
    while (my ($key, $val) = each %obj) {
      $obj{$key} = remove_dups($val, $seen) if ref $val;
    }
    return \%obj;
  } else {
    return $j;
  }
}

my $parsed = decode_json $j;
my $printer = JSON::XS->new->pretty->canonical;
say "Before:";
print $printer->encode($parsed);
say "After:";
my $dedup = remove_dups $parsed;
print $printer->encode($dedup);

produces

Before:
{
   "bar" : {
      "baz" : false,
      "foo" : true
   },
   "cat" : [
      {
         "foo" : 3
      }
   ],
   "dog" : "woof",
   "foo" : 1
}
After:
{
   "bar" : {
      "baz" : false
   },
   "cat" : [
      {}
   ],
   "dog" : "woof"
}

Edit for explanation:

The first time remove_dups is called on a perl data structure representing a json value (Which doesn't have to be a json object), it calls count_keys to recursively walk the structure and create a hash of all the keys and the number of times each one occurs. Then it again recursively walks the structure, returning a deep copy without keys that appeared more than once in the original.

This line is the real magic:

delete @obj{grep { $seen->{$_} > 1 } keys %obj};

It uses a hash slice to delete a bunch of keys all at once, with the grep bit returning a list of keys that appeared more than once. More information on slices.

edited Sep 20 '18 at 01:49

answered Sep 20 '18 at 01:26

Shawn

47,241
3
26
60

I'd do something similar, iterate the JSON twice: 1) build a hash with key/occurrences 2) create a hash, or delete from the source, keep entries that only appear once. – Rafael Sep 20 '18 at 01:31
Implementation has a depth or breadth descent, takes a function ref arg to call on hash entry, pass func ref that constructs key/occur hash, iterate again, pass entry removal func --- presto, O(n). – Rafael Sep 20 '18 at 01:37
Thank you @Shawn, I'll give this a try and update if I run into trouble. – Speeddymon Sep 20 '18 at 15:38
It's best to use `Scalar::Util::reftype` in preference to `ref`, as the latter will return the package name if the reference is to blessed data. – Borodin Sep 20 '18 at 17:50
I've gone and updated the OP taking your "After" hash as a great example of how my data is formatted (with minor tweaking) -- Can you have a look at the output I'm getting after adding your code? I need to account for the anonymous hashes in the 'dog' array having the same (undefined value for a) name. – Speeddymon Sep 23 '18 at 22:03
@Speeddymon When I tested with your new data and my code, I got the desired output – Shawn Sep 23 '18 at 22:49
Hi @Shawn what perl version are you running? I'm using 5.10.1 (the bundled version for RHEL 6.9) -- maybe a quirk of us using different versions? I ran into a similar issue when the data comes through as a more shallow hash of hashes, where $obj->{foo} is a list of anonymous hashes. Since all of the anonymous hash "names" are the same (I assume `undef`) then it deletes the contents of those hashes. I'll update the OP momentarily to help explain – Speeddymon Sep 24 '18 at 02:44
@Speeddymon 5.10? Wow. That's ancient. I don't even know if it supports things like hash slices or defined-or. – Shawn Sep 24 '18 at 03:12
It just about does. The `//=` was introduced in 5.10(.1?). I think that hash slices were in a lot longer. – zdim Sep 24 '18 at 21:03
@zdim yep, it has both. I had to look up the defined-or operator as I hadn't seen it before, but I've seen (and used) hash slices in your code before! Thanks btw! I owe you a few rounds of your favorite drink/soda/coffee/etc ! – Speeddymon Sep 25 '18 at 04:07
Oh before I forget, it doesn't have JSON::XS nor JSON::MaybeXS so I suspect if not a difference between perl versions, then between JSON implementations. I would just grab it from CPAN but it's blocked by our proxy, and not in our (highly) filtered private mirrors of public repos. TBH I'm surprised SO isn't blocked. – Speeddymon Sep 25 '18 at 04:09
@Speeddymon You should post your current version of the code. Plus it looks like the actual problem is different from just 'remove all duplicate keys'? (I think the only JSON parsing module that comes with Perl is JSON::PP, and that wasn't added until long after 5.10) – Shawn Sep 25 '18 at 04:57
@Speeddymon I don't know what you'd owe me for but I'd be sure good to get some drinks/etc together :). I dropped this once Shawn answered but I'll look into it if the problem persist (time's a "little" tight right now) – zdim Sep 25 '18 at 19:37
@Shawn added. Sorry I still had to strip a lot but hopefully the example is usable enough. – Speeddymon Sep 25 '18 at 22:07
@zdim should be ok, but yes lets do if ever we are in the same city together. :-) – Speeddymon Sep 25 '18 at 22:07
@Shawn went ahead and restructured the whole post to be more concise and clear about the current issue. It is still related to removing duplicate keys, but turns into a new challenge when the keys are the pseudo-name of an anonymous hash. – Speeddymon Sep 25 '18 at 22:33
1

@Speeddymon I take this (Shawn's) answer to have done the job (had +1-ed it), and it's nicely accepted -- I posted only for the follow-up issue of empty structures. – zdim Sep 26 '18 at 09:21

zdim · Answer 2 · 2020-07-25T17:24:21.740

I take it that Shawn's answer works for removing duplicates, what it looks good for.

The follow up problem is that we may end up with empty structures and those need be removed as well. But then there may also be structures that contain only empty structures, etc, and I assume that all such need be gone.

I use the desired-result-hashref from the question (from which I remove one name=>... so that there are no duplicates) and add some empty trouble.

use warnings;
use strict;
use feature 'say';
use Data::Dump qw(dd pp);

my $hr = {
    'cat' => 'meow',
    'dog' => [
        { 'a' => { 'chow' =>  { 'name' => 'barney' } }  },
        { 'b' => { 'lab' => 'betty' }, 'c' => 'pebbles' },
        { 'd' => { 'shihtzu' => 'bambam' }              },
        {   # all of the following need to go, and this hashref
            'e' => { },  
            'f' => { noval => { } },
            'g' => [ { }, { nada => { } }, [ ] ],
        },
    ],  
};
dd $hr; say '';

for my $k (sort keys %$hr) {
    next_level($hr, $k, $hr->{$k}, 'key');
}

# Takes: data structure (reference), key/index at which it is found, 
# its value for it, and description string of which it is, 'key|idx'
sub next_level {
    my ($ds, $index, $val, $kind) = @_;
    my $type = ref $val;

    if ($type eq 'ARRAY') {
        for my $i (0..$#$val) {
            next_level(
                ( $kind eq 'key' ? $ds->{$index} : $ds->[$index] ),
                $i, $val->[$i], 'idx' 
            );
        }

        # Collect indices for and delete elements that are empty
        my @to_delete;
        for my $i (0..$#$val) {
            if ( (ref $val->[$i] eq 'HASH'  and not keys %{$val->[$i]}) or
                 (ref $val->[$i] eq 'ARRAY' and not      @{$val->[$i]})  )
            {
                say "No value/empty for index $i, record for deletion";
                push @to_delete, $i;
            }
        }
        if (@to_delete) { 
            my %ref_idx = map { $_ => 1 } @to_delete;
            @$val = @$val[ grep { not exists $ref_idx{$_} } 0..$#$val ];
        }
    }
    elsif ($type eq 'HASH') {
        for my $k (sort keys %{$val}) {
            my $ds_next_level = 
                ($kind eq 'key') ? $ds->{$index} : $ds->[$index];

            next_level( $ds_next_level, $k, $val->{$k}, 'key' );

            # Delete if empty 
            if ( (ref $val->{$k} eq 'HASH'  and not keys %{$val->{$k}}) or
                 (ref $val->{$k} eq 'ARRAY' and not      @{$val->{$k}})  )
            {
                say "No value/empty for key $k, delete";
                delete $ds_next_level->{$k}
            }
        }
    }
    #elsif (not $type) { say "A value: ", $val }
}
say ''; dd $hr;

This is a normal recursive traversal of a complex data structure but with a twist: in order to be able to delete components the recursive sub also needs the data structure itself, at which key (in a hashref) or index (in an arrayref) it is found, and which of the two it was, a key or an index.

After the recursion the target is deleted if it is empty, if it's in a hashref. An arrayref is scanned for all empty elements first and then they are removed by overwriting the arrayref, with an array slice that excludes indices for elements that contain only empty data structures.

For the exclusion of "bad" indices a reference hash is used for efficiency. Overwriting the array may be faster using map (see this post), or it may not be if slicing allows specific (interpreter) optimizations.

The output

{
  cat => "meow",
  dog => [
           { a => { chow => { name => "barney" } } },
           { b => { lab => "betty" }, c => "pebbles" },
           { d => { shihtzu => "bambam" } },
           { e => {}, f => { noval => {} }, g => [{}, { nada => {} }, []] },
         ],
}

No value/empty for key e, delete
No value/empty for key noval, delete
No value/empty for key f, delete
No value/empty for key nada, delete
No value/empty for index 0, record for deletion
No value/empty for index 1, record for deletion
No value/empty for index 2, record for deletion
No value/empty for key g, delete
No value/empty for index 3, record for deletion

{
  cat => "meow",
  dog => [
           { a => { chow => { name => "barney" } } },
           { b => { lab => "betty" }, c => "pebbles" },
           { d => { shihtzu => "bambam" } },
         ],
}

@Speeddymon Updated, as it downed on me that there is no good reason to first check and remove empty things, before recursion. The recursion on the empty data structure is quick, so all empty ones can be removed after it. This lightened the code. — zdim, Sep 26 '18 at 17:40
Thank you @zdim. One question... Shawn's code traverses the hash once, and this appears to be an additional function to add alongside his code. Am I interpreting correctly that the hash would be traversed once by his code and then a second time by yours? Or am I just that dense and missing something totally obvious? — Speeddymon, Sep 26 '18 at 20:53
@Speeddymon Yes, that is correct: This would go over the hashref as it remains after Shawn's code did its thing (which traverses it _twice_ -- once to scoop up dupes and then to remove them). This could be made so to find and delete dupes as well but it would considerably expand it, while that you have. Or, you could merge this into Shawn's second pass? I think that having two subs isn't too bad to start with for this job -- just how big is data? If there _is_ a problem with time then that's a different matter. Having separate subs for this is reasonable in principle, too. — zdim, Sep 27 '18 at 00:11
Okay yes you're right, it would be a third traversal. The data set's size varies. The largest output I got in limited testing was less than 1mb, but it could theoretically send out much much more, as the webapp's output depends on the options passed to the script, and the webapp interfaces with an oracle db. I don't mind traversing the hash a thousand times as long as it doesn't add a *noticeable* delay. Of course, how noticeable the delay is, depends on the output of the webapp. — Speeddymon, Sep 28 '18 at 01:12
Something I thought of that I can look into that will help speed in the case of a larger dataset is to implement threading, but I have no idea if perl 5.10 even has a notion of threads. — Speeddymon, Sep 28 '18 at 01:19
@Speeddymon Okay. I'll look into efficiency (as time permits though), as it may matter by what you're saying. You can absolutely implement this in parallel -- if it comes to that. I don't know whether threads would be th better choice though, compared with forking. This would depend on the whole setup. — zdim, Sep 28 '18 at 02:39

Remove duplicate keys from a hash of hashes and arrays (and ensure any resulting empty hashs are also removed)

2 Answers2