Efficient code to do set operations with bigz-class values?

Question

The current release of the package gmp does not support set operations such as intersect, setdiff , etc. I'm doing some work with number sequences (see OEIS for examples) and need to handle large collections of large integers. I'm currently stuck with using various loops to generate the desired differences or intersections; while I could probably generate compiled (Rccp, etc) code, I'm hoping to find a way within existing R functions and packages.

Could you add more detail about the objects you're working with? For example, how long are the collections, and how big are the numbers? `gmp` doesn't even have a good `sort()` function, so I think it's going to be tricky. — user2554330, Jun 01 '22 at 19:09
a pipeline like `Rmpfr` -> `sets` -> `github EnriquePH/OEIS.R`? — Chris, Jun 01 '22 at 21:42
@user2554330 the problem is that `bigz` - class objects (as well as `bigq` ) do not have a method available for the set operation functions. So I can't do , e.g., `intersect` even on `as.bigz(1:4)` and `as.bigz(3:6)` . Number sequences often grow well past max(int) so I have to use extended math. — Carl Witthoft, Jun 02 '22 at 10:41
It was the 'factorial' example in the 'Arbitrarily Accurate..' vignette that suggested 'happy with integers'. LMGTFY, which I learned from you, didn't let us down. — Chris, Jun 02 '22 at 13:45
@Chris I spoke too soon. `mpfr` doesn't support base set functions and the library `set` doesn't handle `mpfr` objects. So I'm' still stuck. — Carl Witthoft, Jun 02 '22 at 18:57
Not even, speaking of `sets`, as `csets`?, which I, perhaps wrongly took as able to consume R objects (again assuming a mpfr number) and class them cset, well shoot. Am I thinking in the wrong direction [When in doubt = Strings](https://stackoverflow.com/questions/60480632/how-can-i-sort-a-vector-of-bigz-integers-in-r)?, and perhaps `mpfr` wants an 'as.character'... — Chris, Jun 02 '22 at 19:58
@Chris `as.cset(bigz_thing)` looks promising. I'll report back as I get farther. — Carl Witthoft, Jun 03 '22 at 10:15
`cset` will accept `bigz` values but can't perform operations on them. I'll be contacting the maintainer about that. I tried `intersect(as.character(bigz_stuff),as.character(other_bigzstuff))` but that turns out to be slower than just running a for-loop on the `bigz` vectors. — Carl Witthoft, Jun 03 '22 at 15:29
did you consider going through characters for set operations `as.bigz(intersect(as.character(as.bigz("10000000000000000000000000000000000000000")+1:4),as.character(as.bigz("10000000000000000000000000000000000000000")+3:6)))` — Waldi, Jun 09 '22 at 12:41
@Waldi I did use that approach, which does work correctly. The drawback is that it's horribly slow. If I do a while- or for- loop to compare against elements of a set one-by-one, it's faster than converting into and out of characters. — Carl Witthoft, Jun 09 '22 at 14:47
Are there representative OEIS sequences that would reflect your workflow/needs? — Chris, Jun 09 '22 at 15:01
@Chris well, most of them :-) if you go out enough terms. I've run a couple such as Levine (1997) for which the 15th term is 508009471379488821444261986503540 http://oeis.org/A011784 — Carl Witthoft, Jun 09 '22 at 17:20
Okay, I was snooping at A000045 and A078140, since I know the term Fibonacci. github EnriquePH/OEIS.R --no-build-vignettes generally works for getting bfiles. `max(nchar(A000045$data$A000045)) [1] 418` — Chris, Jun 09 '22 at 18:07

GKi · Accepted Answer · 2022-06-18T06:03:11.697

Use bignum instead of gmp what returns a character after using intersect. And reconverting needs time.

library(bignum)

t1 <- biginteger(1:4)
t2 <- biginteger(3:6)
intersect(t1, t2)
#[1] "3" "4"

biginteger(intersect(t1, t2))
#<biginteger[2]>
#[1] 3 4

Store the bigz in a list instead as a vector.

library(gmp)

intersect(as.bigz(1:4), as.bigz(3:6))
#Big Integer ('bigz') object of length 2:
#[1] 1 2

intersect(as.list(as.bigz(1:4)), as.list(as.bigz(3:6)))
#[[1]]
#Big Integer ('bigz') :
#[1] 3
#
#[[2]]
#Big Integer ('bigz') :
#[1] 4

setdiff(as.list(as.bigz(1:4)), as.list(as.bigz(3:6)))
#[[1]]
#Big Integer ('bigz') :
#[1] 1
#
#[[2]]
#Big Integer ('bigz') :
#[1] 2

f3 <- as.list(as.bigz(c(3,5,9,6,4)))
f4 <- as.list(as.bigz(c(6,7,8,10,9)))
intersect(f3, f4)
#[[1]]
#Big Integer ('bigz') :
#[1] 9
#
#[[2]]
#Big Integer ('bigz') :
#[1] 6

Unfortunately this is much slower than converting it to character.

For intersect duplicated could be used. But this is also slower than converting to character.

. <- c(unique(as.bigz(1:4)), unique(as.bigz(3:6)))
.[duplicated(.)]
#Big Integer ('bigz') object of length 2:
#[1] 3 4

And comparing each element with each other is also slower:

t1 <- unique(as.bigz(1:4))
t2 <- unique(as.bigz(3:6))
t1[sapply(t1, function(x) any(x == t2))]
#Big Integer ('bigz') object of length 2:
#[1] 3 4

Marginal faster in this case is:

t1 <- kit::funique(as.character(as.bigz(1:4)))
t2 <- as.character(as.bigz(3:6));
as.bigz(t1[fastmatch::fmatch(t1, t2, 0L) > 0L])
#Big Integer ('bigz') object of length 2:
#[1] 3 4

And a simple Rcpp versions.

library(Rcpp)
sourceCpp(code=r"(
#include <Rcpp.h>
#include <list>
#include <cstring>
using namespace Rcpp;
// [[Rcpp::export]]
RawVector fun(const RawVector &x, const RawVector &y) {
  std::list<uint32_t const*> xAdr;
  std::list<uint32_t const*> yAdr;
  const uint32_t *px = (const uint32_t *) &x[0];
  const uint32_t *py = (const uint32_t *) &y[0];
  uint32_t nextNr = 1;
  for(uint_fast32_t j=0; j<px[0]; ++j) {
    uint32_t n = px[nextNr] + 2;
    auto i = xAdr.cbegin();
    for(; i != xAdr.cend(); ++i) {
      if(std::memcmp(&px[nextNr], *i, n * sizeof(uint32_t)) == 0) break;
    }
    if(i == xAdr.cend()) xAdr.push_back(&px[nextNr]);
    nextNr += n;
  }
  nextNr = 1;
  for(uint_fast32_t j=0; j<py[0]; ++j) {
    yAdr.push_back(&py[nextNr]);
    nextNr += py[nextNr] + 2;;
  }
  uint32_t l=1;
  for(auto j = xAdr.cbegin(); j != xAdr.cend(); ) {
    auto i = yAdr.cbegin();
    for(; i != yAdr.cend(); ++i) {
      if(std::memcmp(*j, *i, (**j + 2) * sizeof(uint32_t)) == 0) {
        yAdr.erase(i);
        break;
      }
    }
    if(i == yAdr.cend()) j = xAdr.erase(j);
    else {l += **j + 2; ++j;}
  }
  RawVector res(Rcpp::no_init(l * sizeof(uint32_t)));
  res.attr("class") = "bigz";
  uint32_t *p = (uint32_t *) &res[0];
  p[0] = xAdr.size();
  nextNr = 1;
  for(auto j = xAdr.cbegin(); j != xAdr.cend(); ++j) {
    std::memcpy(&p[nextNr], *j, (**j + 2) * sizeof(uint32_t));
    nextNr += p[nextNr] + 2;
  }
  return res;
}
)")

fun(as.bigz(1:4), as.bigz(3:6))
#Big Integer ('bigz') object of length 2:
#[1] 3 4

Benchmark

library(gmp)
x <- as.bigz("10000000000000000000000000000000000000000")+1:4
y <- as.bigz("10000000000000000000000000000000000000000")+3:6

library(bignum)
xb <- biginteger("10000000000000000000000000000000000000000")+1:4
yb <- biginteger("10000000000000000000000000000000000000000")+3:6

bench::mark(check = FALSE,
         "list" = do.call(c, intersect(as.list(x), as.list(y))),
         "char" = as.bigz(intersect(as.character(x), as.character(y))),
         "dupli" = {. <- c(unique(x), unique(y)); .[duplicated(.)]},
         "loop" = {t1 <- unique(x); t2 <- unique(y); t1[sapply(t1, function(x) any(x == t2))]},
         "own" = {t1 <- as.character(x); t2 <- as.character(y);
           x[!duplicated(t1) & match(t1, t2, 0L) > 0L]},
         "own2" = {t1 <- unique(as.character(x)); t2 <- as.character(y);
           as.bigz(t1[match(t1, t2, 0L) > 0L])},
         "kitFmat" = {t1 <- kit::funique(as.character(x)); t2 <- as.character(y);
           as.bigz(t1[fastmatch::fmatch(t1, t2, 0L) > 0L])},
         "collFmat" = {t1 <- collapse::funique(as.character(x)); t2 <- as.character(y);
           as.bigz(t1[fastmatch::fmatch(t1, t2, 0L) > 0L])},
         "bignum" = biginteger(intersect(xb, yb)),
         "rcppIdx" = fun(x, y),
         )

Result

   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
 1 list       135.89µs 247.15µs     3147.        0B     2.03  1552     1
 2 char         15.4µs  20.79µs    29933.        0B    12.0   9996     4
 3 dupli       54.58µs  86.69µs     7428.      280B     6.18  3604     3
 4 loop        80.11µs 156.36µs     4198.    6.36KB     8.26  2032     4
 5 own         15.46µs  27.09µs    25254.        0B     5.05  9998     2
 6 own2        13.69µs  20.53µs    28952.        0B     8.69  9997     3
 7 kitFmat     13.23µs  20.57µs    30964.        0B     6.19  9998     2
 8 collFmat    16.76µs  21.76µs    26667.    2.49KB     5.33  9998     2
 9 bignum      32.12µs  48.51µs    10784.        0B     8.47  5090     4
10 rcppIdx      2.35µs   3.09µs   212933.    2.49KB     0    10000     0

Here it looks like that converting to character and back is currently the fastest way. Even subsetting bigz is slower than subsetting the character and converting it to bigz.
The Version in Rcpp is about 6-7 times faster than the fastest other method.

Also thought it would be a solution, but unfortunately slower than `as.character` which according to OP is already too slow.`microbenchmark::microbenchmark(intersect(as.list(as.bigz(1:4)), as.list(as.bigz(3:6))),intersect(as.character(as.bigz(1:4)), as.character(as.bigz(3:6))))` — Waldi, Jun 12 '22 at 06:47
This is nicely presented! I will experiment with `bignum` to see if it can do everything I need in my algorithms (i.e. more than just set-type operations). My benchmarks basically agree with your work, in that a simple `for` - loop to test each element one-by-one remains the fastest I've found. — Carl Witthoft, Jun 13 '22 at 13:23
Hmmm I see that `intersect(bignum_a, bignum_b)` returns the character equivalents. That's unlikely to be a fast operation. — Carl Witthoft, Jun 13 '22 at 13:37
Use of `bignum` works but as with all other approaches so far is far slower than a `for` loop over bigz value comparisons. I'll accept this answer & give you the bonus because of the excellent details & attempts. — Carl Witthoft, Jun 14 '22 at 16:54
I add a simple Rcpp version, for comparing the timings with the other options. — GKi, Jun 15 '22 at 21:51

Chris · Answer 2 · 2022-06-11T23:08:28.150

I've likely got this wrong, but using mprf objects provides access to base R intersect, union, setdiff, while a sort(... needs to be wrapped inside a mprf(sort(...), 'bits'):

library(Rmprf)

f3 <- mpfr(5:9, 53)
f4 <- mpfr(8:12, 53)

intersect(f3,f4)
2 'mpfr' numbers of precision  53   bits 
[1] 8 9

setdiff(f3,f4)
3 'mpfr' numbers of precision  53   bits 
[1] 5 6 7

f3 %in% f4
[1] FALSE FALSE FALSE  TRUE  TRUE

# large integers from vignette
ns <- mpfr(1:24, 120)
fact_ns <- factorial(ns)
fact_ns[20:24]
5 'mpfr' numbers of precision  120   bits 
[1]      2432902008176640000     51090942171709440000   1124000727777607680000
[4]  25852016738884976640000 620448401733239439360000

pasc80 <- chooseMpfr.all(n = 80, 77)[40:49]
pasc80
10 'mpfr' numbers of precision  77   bits 
 [1] 107507208733336176461620 104885081691059684352800  97393290141698278327600
 [4]  86068488962431036661600  72375774809317008101800  57900619847453606481440
 [7]  44054819449149483192400  31869443856831541032800  21910242651571684460050
[10]  14308729894903957198400

mpfr(sort(union(fact_ns[20:24], pasc80)), 77)
15 'mpfr' numbers of precision  77   bits 
 [1]      2432902008176640000     51090942171709440000   1124000727777607680000
 [4]  14308729894903957198400  21910242651571684460050  25852016738884976640000
 [7]  31869443856831541032800  44054819449149483192400  57900619847453606481440
[10]  72375774809317008101800  86068488962431036661600  97393290141698278327600
[13] 104885081691059684352800 107507208733336176461620 6204484017332394393600

so for these operations sets is not necessary, and assuming your workflow is amenable to Rmprf based objects.

As the problem is presented in the context of 'precision', one likely wouldn't want a function that promotes or demotes sets to their highest/lowest 'prec', but be intentionally involved in the decision (though, admittedly, I looked for one).

Here, renaming your f3 & f4 below to f7 & f8:

getPrec(f7)[1]
[1] 10
getPrec(f8)[1]
[1] 20
intersect(roundMpfr(f7, 20), f8)
2 'mpfr' numbers of precision  20   bits 
[1] 9 6
intersect(f7, roundMpfr(f8, 10))
2 'mpfr' numbers of precision  10   bits 
[1] 9 6

So it appears that 'precision handling' is required as to set operations, though such additional cycles may be avoided if it is plausible that upon mpfr creation, defaults would render the inputs the same precision. Using OEIS as inputs:

library(OEIS.R) # git clone of EnriquePH/OEIS.R --no-build-vignettes
A011784 <- OEIS_bfile('A011784')
max(nchar(A011784$data$A011784))
[1] 221
max(nchar(A078140$data$A078140))
[1] 228

# so we see precision handling here, perhaps
A011784_228 <- mpfr(A011784$data$A011784, 228)
A078140_228 <- mpfr(A078140$data$A078140, 228)

intersect(A011784_228,A078140_228)
2 'mpfr' numbers of precision  228   bits 
[1] 1 3

Ah, so little in common. And it is probably not that your sequences are in OEIS, rather checking for similarity to those from your sequences 'from the wild', and this doesn't reflect your workflow.

As to using lists:

is(A011784_bigz)
[1] "bigz"     "oldClass" "Mnumber"  "mNumber" 
> is(A011784_228)
[1] "mpfr"    "list"    "Mnumber" "mNumber" "vector"

So those as.list cycles have already been expended in mpfr creation.

And some related, light reading primitive sets from recent news.

Unfortunately, you've been misled the same way I was. Try `f3 <- mpfr(c(3,5,9,6,4),10 )` and `f4 <- mpfr(c(6,7,8,10,9),20)` and the results of `intersect` will be empty because of the difference in precision. So possibly if prior to each set-operation I were to reset `Precbits" for both objects to the same value, it might work correctly, but that leads to a lot of processing overhead. — Carl Witthoft, Jun 06 '22 at 10:51

Efficient code to do set operations with bigz-class values?

2 Answers2