The current release of the package gmp
does not support set operations such as intersect
, setdiff
, etc. I'm doing some work with number sequences (see OEIS for examples) and need to handle large collections of large integers. I'm currently stuck with using various loops to generate the desired differences or intersections; while I could probably generate compiled (Rccp, etc) code, I'm hoping to find a way within existing R
functions and packages.

- 5,716
- 8
- 28
- 43

- 20,573
- 9
- 43
- 73
-
Could you add more detail about the objects you're working with? For example, how long are the collections, and how big are the numbers? `gmp` doesn't even have a good `sort()` function, so I think it's going to be tricky. – user2554330 Jun 01 '22 at 19:09
-
1a pipeline like `Rmpfr` -> `sets` -> `github EnriquePH/OEIS.R`? – Chris Jun 01 '22 at 21:42
-
@user2554330 the problem is that `bigz` - class objects (as well as `bigq` ) do not have a method available for the set operation functions. So I can't do , e.g., `intersect` even on `as.bigz(1:4)` and `as.bigz(3:6)` . Number sequences often grow well past max(int) so I have to use extended math. – Carl Witthoft Jun 02 '22 at 10:41
-
1It was the 'factorial' example in the 'Arbitrarily Accurate..' vignette that suggested 'happy with integers'. LMGTFY, which I learned from you, didn't let us down. – Chris Jun 02 '22 at 13:45
-
@Chris I spoke too soon. `mpfr` doesn't support base set functions and the library `set` doesn't handle `mpfr` objects. So I'm' still stuck. – Carl Witthoft Jun 02 '22 at 18:57
-
Not even, speaking of `sets`, as `csets`?, which I, perhaps wrongly took as able to consume R objects (again assuming a mpfr number) and class them cset, well shoot. Am I thinking in the wrong direction [When in doubt = Strings](https://stackoverflow.com/questions/60480632/how-can-i-sort-a-vector-of-bigz-integers-in-r)?, and perhaps `mpfr` wants an 'as.character'... – Chris Jun 02 '22 at 19:58
-
@Chris `as.cset(bigz_thing)` looks promising. I'll report back as I get farther. – Carl Witthoft Jun 03 '22 at 10:15
-
`cset` will accept `bigz` values but can't perform operations on them. I'll be contacting the maintainer about that. I tried `intersect(as.character(bigz_stuff),as.character(other_bigzstuff))` but that turns out to be slower than just running a for-loop on the `bigz` vectors. – Carl Witthoft Jun 03 '22 at 15:29
-
did you consider going through characters for set operations `as.bigz(intersect(as.character(as.bigz("10000000000000000000000000000000000000000")+1:4),as.character(as.bigz("10000000000000000000000000000000000000000")+3:6)))` – Waldi Jun 09 '22 at 12:41
-
1@Waldi I did use that approach, which does work correctly. The drawback is that it's horribly slow. If I do a while- or for- loop to compare against elements of a set one-by-one, it's faster than converting into and out of characters. – Carl Witthoft Jun 09 '22 at 14:47
-
Are there representative OEIS sequences that would reflect your workflow/needs? – Chris Jun 09 '22 at 15:01
-
@Chris well, most of them :-) if you go out enough terms. I've run a couple such as Levine (1997) for which the 15th term is 508009471379488821444261986503540 http://oeis.org/A011784 – Carl Witthoft Jun 09 '22 at 17:20
-
Okay, I was snooping at A000045 and A078140, since I know the term Fibonacci. github EnriquePH/OEIS.R --no-build-vignettes generally works for getting bfiles. `max(nchar(A000045$data$A000045)) [1] 418` – Chris Jun 09 '22 at 18:07
2 Answers
Use bignum
instead of gmp
what returns a character
after using intersect. And reconverting needs time.
library(bignum)
t1 <- biginteger(1:4)
t2 <- biginteger(3:6)
intersect(t1, t2)
#[1] "3" "4"
biginteger(intersect(t1, t2))
#<biginteger[2]>
#[1] 3 4
Store the bigz
in a list
instead as a vector
.
library(gmp)
intersect(as.bigz(1:4), as.bigz(3:6))
#Big Integer ('bigz') object of length 2:
#[1] 1 2
intersect(as.list(as.bigz(1:4)), as.list(as.bigz(3:6)))
#[[1]]
#Big Integer ('bigz') :
#[1] 3
#
#[[2]]
#Big Integer ('bigz') :
#[1] 4
setdiff(as.list(as.bigz(1:4)), as.list(as.bigz(3:6)))
#[[1]]
#Big Integer ('bigz') :
#[1] 1
#
#[[2]]
#Big Integer ('bigz') :
#[1] 2
f3 <- as.list(as.bigz(c(3,5,9,6,4)))
f4 <- as.list(as.bigz(c(6,7,8,10,9)))
intersect(f3, f4)
#[[1]]
#Big Integer ('bigz') :
#[1] 9
#
#[[2]]
#Big Integer ('bigz') :
#[1] 6
Unfortunately this is much slower than converting it to character.
For intersect duplicated
could be used. But this is also slower than converting to character.
. <- c(unique(as.bigz(1:4)), unique(as.bigz(3:6)))
.[duplicated(.)]
#Big Integer ('bigz') object of length 2:
#[1] 3 4
And comparing each element with each other is also slower:
t1 <- unique(as.bigz(1:4))
t2 <- unique(as.bigz(3:6))
t1[sapply(t1, function(x) any(x == t2))]
#Big Integer ('bigz') object of length 2:
#[1] 3 4
Marginal faster in this case is:
t1 <- kit::funique(as.character(as.bigz(1:4)))
t2 <- as.character(as.bigz(3:6));
as.bigz(t1[fastmatch::fmatch(t1, t2, 0L) > 0L])
#Big Integer ('bigz') object of length 2:
#[1] 3 4
And a simple Rcpp versions.
library(Rcpp)
sourceCpp(code=r"(
#include <Rcpp.h>
#include <list>
#include <cstring>
using namespace Rcpp;
// [[Rcpp::export]]
RawVector fun(const RawVector &x, const RawVector &y) {
std::list<uint32_t const*> xAdr;
std::list<uint32_t const*> yAdr;
const uint32_t *px = (const uint32_t *) &x[0];
const uint32_t *py = (const uint32_t *) &y[0];
uint32_t nextNr = 1;
for(uint_fast32_t j=0; j<px[0]; ++j) {
uint32_t n = px[nextNr] + 2;
auto i = xAdr.cbegin();
for(; i != xAdr.cend(); ++i) {
if(std::memcmp(&px[nextNr], *i, n * sizeof(uint32_t)) == 0) break;
}
if(i == xAdr.cend()) xAdr.push_back(&px[nextNr]);
nextNr += n;
}
nextNr = 1;
for(uint_fast32_t j=0; j<py[0]; ++j) {
yAdr.push_back(&py[nextNr]);
nextNr += py[nextNr] + 2;;
}
uint32_t l=1;
for(auto j = xAdr.cbegin(); j != xAdr.cend(); ) {
auto i = yAdr.cbegin();
for(; i != yAdr.cend(); ++i) {
if(std::memcmp(*j, *i, (**j + 2) * sizeof(uint32_t)) == 0) {
yAdr.erase(i);
break;
}
}
if(i == yAdr.cend()) j = xAdr.erase(j);
else {l += **j + 2; ++j;}
}
RawVector res(Rcpp::no_init(l * sizeof(uint32_t)));
res.attr("class") = "bigz";
uint32_t *p = (uint32_t *) &res[0];
p[0] = xAdr.size();
nextNr = 1;
for(auto j = xAdr.cbegin(); j != xAdr.cend(); ++j) {
std::memcpy(&p[nextNr], *j, (**j + 2) * sizeof(uint32_t));
nextNr += p[nextNr] + 2;
}
return res;
}
)")
fun(as.bigz(1:4), as.bigz(3:6))
#Big Integer ('bigz') object of length 2:
#[1] 3 4
Benchmark
library(gmp)
x <- as.bigz("10000000000000000000000000000000000000000")+1:4
y <- as.bigz("10000000000000000000000000000000000000000")+3:6
library(bignum)
xb <- biginteger("10000000000000000000000000000000000000000")+1:4
yb <- biginteger("10000000000000000000000000000000000000000")+3:6
bench::mark(check = FALSE,
"list" = do.call(c, intersect(as.list(x), as.list(y))),
"char" = as.bigz(intersect(as.character(x), as.character(y))),
"dupli" = {. <- c(unique(x), unique(y)); .[duplicated(.)]},
"loop" = {t1 <- unique(x); t2 <- unique(y); t1[sapply(t1, function(x) any(x == t2))]},
"own" = {t1 <- as.character(x); t2 <- as.character(y);
x[!duplicated(t1) & match(t1, t2, 0L) > 0L]},
"own2" = {t1 <- unique(as.character(x)); t2 <- as.character(y);
as.bigz(t1[match(t1, t2, 0L) > 0L])},
"kitFmat" = {t1 <- kit::funique(as.character(x)); t2 <- as.character(y);
as.bigz(t1[fastmatch::fmatch(t1, t2, 0L) > 0L])},
"collFmat" = {t1 <- collapse::funique(as.character(x)); t2 <- as.character(y);
as.bigz(t1[fastmatch::fmatch(t1, t2, 0L) > 0L])},
"bignum" = biginteger(intersect(xb, yb)),
"rcppIdx" = fun(x, y),
)
Result
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
1 list 135.89µs 247.15µs 3147. 0B 2.03 1552 1
2 char 15.4µs 20.79µs 29933. 0B 12.0 9996 4
3 dupli 54.58µs 86.69µs 7428. 280B 6.18 3604 3
4 loop 80.11µs 156.36µs 4198. 6.36KB 8.26 2032 4
5 own 15.46µs 27.09µs 25254. 0B 5.05 9998 2
6 own2 13.69µs 20.53µs 28952. 0B 8.69 9997 3
7 kitFmat 13.23µs 20.57µs 30964. 0B 6.19 9998 2
8 collFmat 16.76µs 21.76µs 26667. 2.49KB 5.33 9998 2
9 bignum 32.12µs 48.51µs 10784. 0B 8.47 5090 4
10 rcppIdx 2.35µs 3.09µs 212933. 2.49KB 0 10000 0
Here it looks like that converting to character
and back is currently the fastest way. Even subsetting bigz
is slower than subsetting the character
and converting it to bigz
.
The Version in Rcpp is about 6-7 times faster than the fastest other method.

- 37,245
- 2
- 26
- 48
-
-
Also thought it would be a solution, but unfortunately slower than `as.character` which according to OP is already too slow.`microbenchmark::microbenchmark(intersect(as.list(as.bigz(1:4)), as.list(as.bigz(3:6))),intersect(as.character(as.bigz(1:4)), as.character(as.bigz(3:6))))` – Waldi Jun 12 '22 at 06:47
-
-
This is nicely presented! I will experiment with `bignum` to see if it can do everything I need in my algorithms (i.e. more than just set-type operations). My benchmarks basically agree with your work, in that a simple `for` - loop to test each element one-by-one remains the fastest I've found. – Carl Witthoft Jun 13 '22 at 13:23
-
Hmmm I see that `intersect(bignum_a, bignum_b)` returns the character equivalents. That's unlikely to be a fast operation. – Carl Witthoft Jun 13 '22 at 13:37
-
Use of `bignum` works but as with all other approaches so far is far slower than a `for` loop over bigz value comparisons. I'll accept this answer & give you the bonus because of the excellent details & attempts. – Carl Witthoft Jun 14 '22 at 16:54
-
I add a simple Rcpp version, for comparing the timings with the other options. – GKi Jun 15 '22 at 21:51
I've likely got this wrong, but using mprf
objects provides access to base R intersect
, union
, setdiff
, while a sort(...
needs to be wrapped inside a mprf(sort(...), 'bits')
:
library(Rmprf)
f3 <- mpfr(5:9, 53)
f4 <- mpfr(8:12, 53)
intersect(f3,f4)
2 'mpfr' numbers of precision 53 bits
[1] 8 9
setdiff(f3,f4)
3 'mpfr' numbers of precision 53 bits
[1] 5 6 7
f3 %in% f4
[1] FALSE FALSE FALSE TRUE TRUE
# large integers from vignette
ns <- mpfr(1:24, 120)
fact_ns <- factorial(ns)
fact_ns[20:24]
5 'mpfr' numbers of precision 120 bits
[1] 2432902008176640000 51090942171709440000 1124000727777607680000
[4] 25852016738884976640000 620448401733239439360000
pasc80 <- chooseMpfr.all(n = 80, 77)[40:49]
pasc80
10 'mpfr' numbers of precision 77 bits
[1] 107507208733336176461620 104885081691059684352800 97393290141698278327600
[4] 86068488962431036661600 72375774809317008101800 57900619847453606481440
[7] 44054819449149483192400 31869443856831541032800 21910242651571684460050
[10] 14308729894903957198400
mpfr(sort(union(fact_ns[20:24], pasc80)), 77)
15 'mpfr' numbers of precision 77 bits
[1] 2432902008176640000 51090942171709440000 1124000727777607680000
[4] 14308729894903957198400 21910242651571684460050 25852016738884976640000
[7] 31869443856831541032800 44054819449149483192400 57900619847453606481440
[10] 72375774809317008101800 86068488962431036661600 97393290141698278327600
[13] 104885081691059684352800 107507208733336176461620 6204484017332394393600
so for these operations sets
is not necessary, and assuming your workflow is amenable to Rmprf
based objects.
As the problem is presented in the context of 'precision', one likely wouldn't want a function that promotes or demotes sets to their highest/lowest 'prec', but be intentionally involved in the decision (though, admittedly, I looked for one).
Here, renaming your f3 & f4 below to f7 & f8:
getPrec(f7)[1]
[1] 10
getPrec(f8)[1]
[1] 20
intersect(roundMpfr(f7, 20), f8)
2 'mpfr' numbers of precision 20 bits
[1] 9 6
intersect(f7, roundMpfr(f8, 10))
2 'mpfr' numbers of precision 10 bits
[1] 9 6
So it appears that 'precision handling' is required as to set operations, though such additional cycles may be avoided if it is plausible that upon mpfr creation, defaults would render the inputs the same precision. Using OEIS as inputs:
library(OEIS.R) # git clone of EnriquePH/OEIS.R --no-build-vignettes
A011784 <- OEIS_bfile('A011784')
max(nchar(A011784$data$A011784))
[1] 221
max(nchar(A078140$data$A078140))
[1] 228
# so we see precision handling here, perhaps
A011784_228 <- mpfr(A011784$data$A011784, 228)
A078140_228 <- mpfr(A078140$data$A078140, 228)
intersect(A011784_228,A078140_228)
2 'mpfr' numbers of precision 228 bits
[1] 1 3
Ah, so little in common. And it is probably not that your sequences are in OEIS, rather checking for similarity to those from your sequences 'from the wild', and this doesn't reflect your workflow.
As to using lists:
is(A011784_bigz)
[1] "bigz" "oldClass" "Mnumber" "mNumber"
> is(A011784_228)
[1] "mpfr" "list" "Mnumber" "mNumber" "vector"
So those as.list cycles have already been expended in mpfr creation.
And some related, light reading primitive sets from recent news.

- 1,647
- 1
- 18
- 25
-
1Unfortunately, you've been misled the same way I was. Try `f3 <- mpfr(c(3,5,9,6,4),10 )` and `f4 <- mpfr(c(6,7,8,10,9),20)` and the results of `intersect` will be empty because of the difference in precision. So possibly if prior to each set-operation I were to reset `Precbits" for both objects to the same value, it might work correctly, but that leads to a lot of processing overhead. – Carl Witthoft Jun 06 '22 at 10:51