3

I'm using arbitrary-precision integers for representing dense bit-vectors—ranging in size from a dozen up to a few thousand.

My code frequently needs to check if certain bits are set (or not), so I did a few micro-benchmarks to see if some variations were significantly faster than others:

bench_1(0, _, _) :- !.
bench_1(N, V, P) :- V /\ (1 << P) =\= 0, N0 is N-1, bench_1(N0, V, P).

bench_2(0, _, _) :- !.
bench_2(N, V, P) :- (V >> P) /\ 1 =:= 1, N0 is N-1, bench_2(N0, V, P).

bench_3(0, _, _) :- !.
bench_3(N, V, P) :- (V >> P) /\ 1 =\= 0, N0 is N-1, bench_3(N0, V, P).

bench_4(0, _, _) :- !.
bench_4(N, V, P) :- (V >> P) /\ 1  >  0, N0 is N-1, bench_4(N0, V, P).

bench_5(0, _, _) :- !.
bench_5(N, V, P) :- 1 is (V >> P) /\  1, N0 is N-1, bench_5(N0, V, P).

Both with SWI and SICStus the above variants are all (almost) equally fast.

Then I stumbled upon the following interesting part of the SWI-Prolog manual:

getbit(+IntExprV, +IntExprI)

Evaluates to the bit value (0 or 1) of the IntExprI-th bit of IntExprV. Both arguments must evaluate to non-negative integers. The result is equivalent to (IntExprV >> IntExprI)/\1, but more efficient because materialization of the shifted value is avoided.

Future versions will optimise (IntExprV >> IntExprI)/\1 to a call to getbit/2, providing both portability and performance.

So I checked out getbit/2:

bench_6(0, _, _) :- !.
bench_6(N, V, P) :- getbit(V,P) =:= 1, N0 is N-1, bench_6(N0, V, P).

I used the following code for micro-benchmarkng:

call_indi_delta(G, What, Delta) :-
   statistics(What, [V0|_]),
   call(G),
   statistics(What, [V1|_]),
   Delta is V1 - V0.

run(Ind, Reps, Expr, Pos) :-
   Position is Pos,
   Value    is Expr,
   member(P_3, [bench_1,bench_2,bench_3,bench_4,bench_5,bench_6]),
   G =.. [P_3,Reps,Value,Position],
   call_indi_delta(G, Ind, T_ms), 
   write(P_3:Reps=T_ms), nl,
   false.

With run(runtime, 10000000, 1<<1000-1, 200) I observed these runtimes:

        | SWI    | SWI -O | SICStus | SICStus |
        | 7.3.23 | 7.3.23 |   4.3.2 |   4.3.3 |
--------+-----------------+-------------------|
bench_1 | 4547ms | 3704ms |   900ms |   780ms |
bench_2 | 4562ms | 3619ms |   970ms |   850ms |
bench_3 | 4541ms | 3603ms |   970ms |   870ms |
bench_4 | 4541ms | 3633ms |   940ms |   890ms |
bench_5 | 4502ms | 3632ms |   950ms |   840ms |
--------+-----------------+-------------------|
bench_6 | 1424ms |  797ms |    n.a. |    n.a. |

It appears that:

Is there some better formulation (arith. fun., etc.) to get a similar speedup with SICStus?

Thank you in advance!

repeat
  • 18,496
  • 4
  • 54
  • 166
  • 2
    SWI-Prolog [uses `mpz_tstbit`](https://github.com/SWI-Prolog/swipl-devel/blob/198feb84f2218c90f1d7d98eca6f1fe60375e3ff/src/pl-arith.c#L2680) from [GMP](https://gmplib.org/manual/Integer-Logic-and-Bit-Fiddling.html) (look towards the bottom). Do you know how SICStus implements multiple-precision integers? –  Jul 07 '16 at 08:34
  • @Boris. I know GMP, but I don't know how SWI integrates it. (I'm still lost in the SWI sources.) Thank you for the links! About SICStus: I do not know that, but I have a hunch that the [tag:ffi] incurs some substantial calling overhead, but I'm not sure... I'll try the C interface to check that out. – repeat Jul 07 '16 at 10:35
  • 2
    SWI uses GMP for integers that get too big. If you grep for getbit in src/*.c you can find the line of code I linked. –  Jul 07 '16 at 11:27
  • @Boris. I got that. But how does it handle memory allocation? GMP allows registering custom allocators https://gmplib.org/manual/Custom-Allocation.html but with an important caveat: "*GMP may use allocated blocks to hold pointers to other allocated blocks. This will limit the assumptions a conservative garbage collection scheme can make.*" – repeat Jul 07 '16 at 11:42

1 Answers1

4

No, I do not think that there are faster formulations than the ones you tried. In particular, there is nothing like getbit/2 in SICStus (not even used internally when compiling arithmetics).

PS. I would use walltime, in general, for benchmarking. Current OSes do not provide a very reliable runtime.

PPS. I would add a benchmark that uses a dummy version of the tested code sequence, just to ensure that the tested code actually does cost much more than the benchmarking harness. (I did, and replacing the bit-test with a call to a dummy/3 that does nothing, makes it much faster. So the benchmark seems OK.)

Per Mildner
  • 10,469
  • 23
  • 30
  • Thx! Should I try using the [tag:ffi]? IIRC the calling overhead appeared a bit on the high side with the JIT in SICStus 4.3.2 (64bit), has that changed? – repeat Jul 07 '16 at 10:36
  • I tried `walltime`, too. I expected to see noticable GC-overhead with big bit-vectors (IIRC GC-time is part of `walltime`). I was surprised that I didn't. – repeat Jul 07 '16 at 10:41
  • 2
    @repeat My guess is that fli would have too high overhead (and accessing the contents of the big integer from C is also not free) but it is hard to know without trying. – Per Mildner Jul 08 '16 at 06:38