6

I'm looking for a data structure which should preferably perform equal O(1)? for any number of elements when adding/removing/retrieving elements.

Here are some additional guidelines,

  • retrieving elements should not involve slow keys()
  • elements must be always unique and defined
  • element order is not significant
  • addition or removal of element should not involve iteration over other elements
  • gaps in retrieved list of elements are tolerable and can be represented with undef value

Please suggest better solution than,

sub uniqArrayFactory {
  my $members = [];
  my $seen = {};
  my $gaps = [];

  return sub {
    my (%arg) = @_;

    return $members if $arg{members};
    my $m;
    if (defined ($m = $arg{del})) {

      return if !$seen->{$m};
      ${ $seen->{$m} } = undef;
      push @$gaps, delete($seen->{$m});
    }
    elsif (defined ($m = $arg{add})) {

      return if $seen->{$m};
      if (@$gaps) {
        $seen->{$m} = pop @$gaps;
        ${ $seen->{$m} } = $m;
      }
      else {
        push @$members, $m;
        $seen->{$m} = \( $members->[-1] );
      }
    }
    return $m;
  };
}

UPDATE (usage)

my $fa = uniqArrayFactory();

$fa->(add => 10);
$fa->(del => 10);
my $members = $fa->(mebers => 1);
mpapec
  • 50,217
  • 8
  • 67
  • 127
  • Can you add some examples of how you would call that please? – simbabque May 19 '16 at 12:10
  • @simbabque please check update. – mpapec May 19 '16 at 12:24
  • 5
    ... am I being dense, or are you describing a hash? – Sobrique May 19 '16 at 12:43
  • @Sobrique yes I am, but currently not happy with `keys` performance, and it would get invoked a lot. – mpapec May 19 '16 at 12:47
  • 7
    I'm sure you won't be able to write something in Perl that is faster than the built-in `keys` operator, which is written in C. I think you need to explain more about what you're doing that is giving you unacceptable performance. `keys` is a trivial operation and does little more than return a pre-existing list of C strings reformatted as Perl scalars – Borodin May 19 '16 at 13:05
  • 3
    @Borodin returning the ready array ref `$members` is faster as `return [keys %$seen]`. So, if the factory must be very fast for the `$factory->(members => 1)` then the `keys` isn't the fastest solution. – clt60 May 19 '16 at 14:33
  • 1
    keys() is very far from O(1), not to mention what jm666 wrote. I see this as a worthy memory/speed tradeoff in case when elements are often being retrieved. – mpapec May 19 '16 at 15:51
  • At some point, you will iterate the member array which is O(n) anyway. So I don't understand why you insist on returning the member array in O(1). – nwellnhof May 19 '16 at 19:06
  • 1
    @nwellnhof that is a valid point so I did some benchmarks which include member iteration, https://gist.github.com/mpapec/29198302b7cca8d3c74be45ff4b12230 (`keys()` and a few flavors of `each()`) and differences are significant. – mpapec May 19 '16 at 20:28
  • 1
    @Сухой27 Interesting, I didn't realize that `keys` and `each` are so slow. `for my $v (keys %hash1)` is a bit faster but still considerably slower than your uarray. – nwellnhof May 19 '16 at 22:49

2 Answers2

2

keys and each are surprisingly slow indeed. But if you store each element as a value of a hash and use values, things get a low faster. With

use strict;
use warnings;
use Benchmark qw(:all);

my $i;
my $fa;
my %hash;

my %compare = (
  uarray => sub {
    $fa->(add => $i++);
    my $memb = $fa->(members => 1);
    for my $v (@$memb) { next if !defined $v; }
  },
  hash => sub {
    $hash{ $i } = $i;
    for my $v (values %hash) {}
    $i++;
  },
);

$i = 0; $fa = uniqArrayFactory(); %hash = ();
cmpthese(10000, \%compare);

sub uniqArrayFactory {
  my $members = [];
  my $seen = {};
  my $gaps = [];

  return sub {
    my (%arg) = @_;

    return $members if exists $arg{members};
    my $m;
    if (defined ($m = $arg{del})) {

      return if !$seen->{$m};
      ${ $seen->{$m} } = undef;
      push @$gaps, delete($seen->{$m});
    }
    elsif (defined ($m = $arg{add})) {

      return if $seen->{$m};
      if (@$gaps) {
        $seen->{$m} = pop @$gaps;
        ${ $seen->{$m} } = $m;
      }
      else {
        push @$members, $m;
        $seen->{$m} = \( $members->[-1] );
      }
    }
    return $m;
  };
}

I get:

         Rate   hash uarray
hash   3205/s     --    -6%
uarray 3401/s     6%     --
nwellnhof
  • 32,319
  • 7
  • 89
  • 113
  • Great to see `values()` performing much better, although performance gap is greater on `v5.20.3` (above 20%). Which perl version did you use? – mpapec May 20 '16 at 05:36
  • 1
    I was using 5.18 on OS X. With 5.22 on Ubuntu I get -13% and 15%. – nwellnhof May 20 '16 at 12:15
1

Ironically, maybe Tie::IxHash, which was motivated by the desire to retrieve the keys of a hash in a specified order, is as close as you're going to get to what you want.

In the Tie::IxHash implementation, keys and values are stored in array references. keys returns a copy of the set of keys, but something like (tied %hash)->[1] would give you direct access to it.

Removing elements in a Tie::IxHash is O(n). A possible workaround for that would be to replace values with undef rather than deleting them. That is, preferring

$ixhash{$obsolete_key} = undef;

to

delete $ixhash{$obsolete_key};

Or if you were able to pool your deletions -- if you could organize your code so that you usually called delete on several keys around the same time and in between other operations on the hash -- then there are opportunities for improving on Tie::IxHash.

mob
  • 117,087
  • 18
  • 149
  • 283