Data structure for unique element storage

Question

I'm looking for a data structure which should preferably perform equal O(1)? for any number of elements when adding/removing/retrieving elements.

Here are some additional guidelines,

retrieving elements should not involve slow keys()
elements must be always unique and defined
element order is not significant
addition or removal of element should not involve iteration over other elements
gaps in retrieved list of elements are tolerable and can be represented with undef value

Please suggest better solution than,

sub uniqArrayFactory {
  my $members = [];
  my $seen = {};
  my $gaps = [];

  return sub {
    my (%arg) = @_;

    return $members if $arg{members};
    my $m;
    if (defined ($m = $arg{del})) {

      return if !$seen->{$m};
      ${ $seen->{$m} } = undef;
      push @$gaps, delete($seen->{$m});
    }
    elsif (defined ($m = $arg{add})) {

      return if $seen->{$m};
      if (@$gaps) {
        $seen->{$m} = pop @$gaps;
        ${ $seen->{$m} } = $m;
      }
      else {
        push @$members, $m;
        $seen->{$m} = \( $members->[-1] );
      }
    }
    return $m;
  };
}

UPDATE (usage)

my $fa = uniqArrayFactory();

$fa->(add => 10);
$fa->(del => 10);
my $members = $fa->(mebers => 1);

Can you add some examples of how you would call that please? — simbabque, May 19 '16 at 12:10
@Sobrique yes I am, but currently not happy with `keys` performance, and it would get invoked a lot. — mpapec, May 19 '16 at 12:47
I'm sure you won't be able to write something in Perl that is faster than the built-in `keys` operator, which is written in C. I think you need to explain more about what you're doing that is giving you unacceptable performance. `keys` is a trivial operation and does little more than return a pre-existing list of C strings reformatted as Perl scalars — Borodin, May 19 '16 at 13:05
@Borodin returning the ready array ref `$members` is faster as `return [keys %$seen]`. So, if the factory must be very fast for the `$factory->(members => 1)` then the `keys` isn't the fastest solution. — clt60, May 19 '16 at 14:33
keys() is very far from O(1), not to mention what jm666 wrote. I see this as a worthy memory/speed tradeoff in case when elements are often being retrieved. — mpapec, May 19 '16 at 15:51
At some point, you will iterate the member array which is O(n) anyway. So I don't understand why you insist on returning the member array in O(1). — nwellnhof, May 19 '16 at 19:06
@nwellnhof that is a valid point so I did some benchmarks which include member iteration, https://gist.github.com/mpapec/29198302b7cca8d3c74be45ff4b12230 (`keys()` and a few flavors of `each()`) and differences are significant. — mpapec, May 19 '16 at 20:28
@Сухой27 Interesting, I didn't realize that `keys` and `each` are so slow. `for my $v (keys %hash1)` is a bit faster but still considerably slower than your uarray. — nwellnhof, May 19 '16 at 22:49

score 2 · Accepted Answer · answered May 19 '16 at 23:28

keys and each are surprisingly slow indeed. But if you store each element as a value of a hash and use values, things get a low faster. With

use strict;
use warnings;
use Benchmark qw(:all);

my $i;
my $fa;
my %hash;

my %compare = (
  uarray => sub {
    $fa->(add => $i++);
    my $memb = $fa->(members => 1);
    for my $v (@$memb) { next if !defined $v; }
  },
  hash => sub {
    $hash{ $i } = $i;
    for my $v (values %hash) {}
    $i++;
  },
);

$i = 0; $fa = uniqArrayFactory(); %hash = ();
cmpthese(10000, \%compare);

sub uniqArrayFactory {
  my $members = [];
  my $seen = {};
  my $gaps = [];

  return sub {
    my (%arg) = @_;

    return $members if exists $arg{members};
    my $m;
    if (defined ($m = $arg{del})) {

      return if !$seen->{$m};
      ${ $seen->{$m} } = undef;
      push @$gaps, delete($seen->{$m});
    }
    elsif (defined ($m = $arg{add})) {

      return if $seen->{$m};
      if (@$gaps) {
        $seen->{$m} = pop @$gaps;
        ${ $seen->{$m} } = $m;
      }
      else {
        push @$members, $m;
        $seen->{$m} = \( $members->[-1] );
      }
    }
    return $m;
  };
}

I get:

         Rate   hash uarray
hash   3205/s     --    -6%
uarray 3401/s     6%     --

Great to see `values()` performing much better, although performance gap is greater on `v5.20.3` (above 20%). Which perl version did you use? — mpapec, May 20 '16 at 05:36
I was using 5.18 on OS X. With 5.22 on Ubuntu I get -13% and 15%. — nwellnhof, May 20 '16 at 12:15

score 1 · Answer 2 · answered May 19 '16 at 18:27

Ironically, maybe Tie::IxHash, which was motivated by the desire to retrieve the keys of a hash in a specified order, is as close as you're going to get to what you want.

In the Tie::IxHash implementation, keys and values are stored in array references. keys returns a copy of the set of keys, but something like (tied %hash)->[1] would give you direct access to it.

Removing elements in a Tie::IxHash is O(n). A possible workaround for that would be to replace values with undef rather than deleting them. That is, preferring

$ixhash{$obsolete_key} = undef;

to

delete $ixhash{$obsolete_key};

Or if you were able to pool your deletions -- if you could organize your code so that you usually called delete on several keys around the same time and in between other operations on the hash -- then there are opportunities for improving on Tie::IxHash.

Data structure for unique element storage

2 Answers2