How do I represent sparse arrays in Pari/GP?

Question

I have a function that returns integer values to integer input. The output values are relatively sparse; the function only returns around 2^14 unique outputs for input values 1....2^16. I want to create a dataset that lets me quickly find the inputs that produce any given output.

At present, I'm storing my dataset in a Map of Lists, with each output value serving as the key for a List of input values. This seems slow and appears to use a whole of stack space. Is there a more efficient way to create/store/access my dataset?

Added: It turns out the time taken by my sparesearray() function varies hugely on the ratio of output values (i.e., keys) to input values (values stored in the lists). Here's the time taken for a function that requires many lists, each with only a few values:

? sparsearray(2^16,x->x\7);
time = 126 ms.

Here's the time taken for a function that requires only a few lists, each with many values:

? sparsearray(2^12,x->x%7);
time = 218 ms.
? sparsearray(2^13,x->x%7);
time = 892 ms.
? sparsearray(2^14,x->x%7);
time = 3,609 ms.

As you can see, the time increases exponentially!

Here's my code:

\\ sparsearray takes two arguments, an integer "n"  and a closure "myfun", 
\\ and returns a Map() in which each key a number, and each key is associated 
\\ with a List() of the input numbers for which the closure produces that output. 
\\ E.g.:
\\ ? sparsearray(10,x->x%3)
\\ %1 = Map([0, List([3, 6, 9]); 1, List([1, 4, 7, 10]); 2, List([2, 5, 8])])
sparsearray(n,myfun=(x)->x)=
{
    my(m=Map(),output,oldvalue=List());
    for(loop=1,n,
        output=myfun(loop);                      
        if(!mapisdefined(m,output), 
        /* then */
            oldvalue=List(),
        /* else */    
            oldvalue=mapget(m,output));
        listput(oldvalue,loop);
        mapput(m,output,oldvalue));
    m
}

Andrew · Accepted Answer · 2018-05-08T20:21:11.950

To some extent, the behavior you are seeing is to be expected. PARI appears to pass lists and maps by value rather than reference except to the special inbuilt functions for manipulating them. This can be seen by creating a wrapper function like mylistput(list,item)=listput(list,item);. When you try to use this function you will discover that it doesn't work because it is operating on a copy of the list. Arguably, this is a bug in PARI, but perhaps they have their reasons. The upshot of this behavior is each time you add an element to one of the lists stored in the map, the entire list is being copied, possibly twice.

The following is a solution that avoids this issue.

sparsearray(n,myfun=(x)->x)=
{
   my(vi=vector(n, i, i)); \\ input values
   my(vo=vector(n, i, myfun(vi[i]))); \\ output values
   my(perm=vecsort(vo,,1)); \\ obtain order of output values as a permutation
   my(list=List(), bucket=List(), key);
   for(loop=1, #perm, 
      if(loop==1||vo[perm[loop]]<>key, 
          if(#bucket, listput(list,[key,Vec(bucket)]);bucket=List()); key=vo[perm[loop]]);
      listput(bucket,vi[perm[loop]])
   );

   if(#bucket, listput(list,[key,Vec(bucket)])); 
   Mat(Col(list))
}

The output is a matrix in the same format as a map - if you would rather a map then it can be converted with Map(...), but you probably want a matrix for processing since there is no built in function on a map to get the list of keys.

I did a little bit of reworking of the above to try and make something more akin to GroupBy in C#. (a function that could have utility for many things)

VecGroupBy(v, f)={
   my(g=vector(#v, i, f(v[i]))); \\ groups
   my(perm=vecsort(g,,1)); 
   my(list=List(), bucket=List(), key);
   for(loop=1, #perm, 
      if(loop==1||g[perm[loop]]<>key, 
          if(#bucket, listput(list,[key,Vec(bucket)]);bucket=List()); key=g[perm[loop]]);
      listput(bucket, v[perm[loop]])
   );
   if(#bucket, listput(list,[key,Vec(bucket)])); 
   Mat(Col(list))
}

You would use this like VecGroupBy([1..300],i->i%7).

score 1 · Answer 2 · answered Aug 24 '21 at 10:55

There is no good native GP solution because of the way garbage collection occurs because passing arguments by reference has to be restricted in GP's memory model (from version 2.13 on, it is supported for function arguments using the ~ modifier, but not for map components).

Here is a solution using the libpari function vec_equiv(), which returns the equivalence classes of identical objects in a vector.

install(vec_equiv,G);
sparsearray(n, f=x->x)=
{
  my(v = vector(n, x, f(x)), e  = vec_equiv(v));
  [vector(#e, i, v[e[i][1]]), e];
}

? sparsearray(10, x->x%3)
%1 = [[0, 1, 2], [Vecsmall([3, 6, 9]), Vecsmall([1, 4, 7, 10]), Vecsmall([2, 5, 8])]]

(you have 3 values corresponding to the 3 given sets of indices)

The behaviour is linear as expected

 ? sparsearray(2^20,x->x%7);
 time = 307 ms.
 ? sparsearray(2^21,x->x%7);
 time = 670 ms.
 ? sparsearray(2^22,x->x%7);
 time = 1,353 ms.

I wasn't aware of the new ability to pass arguments by reference, or of the very useful low-level vector manipulation functions in libpari. I commend them to other programmers: the functions vec_append and vec_prepend are clearer than using reassignments; and vec_equiv would be clearer than, e.g., vecsort(v,,8). — Joe, Aug 26 '21 at 01:39

Andrew · Answer 3 · 2018-05-07T01:23:20.043

Use mapput, mapget and mapisdefined methods on a map created with Map(). If multiple dimensions are required, then use a polynomial or vector key.

I guess that is what you are already doing, and I'm not sure there is a better way. Do you have some code? From personal experience, 2^16 values with 2^14 keys should not be an issue with regards to speed or memory - there may be some unnecessary copying going on in your implementation.

How do I represent sparse arrays in Pari/GP?

3 Answers3