Naively assuming considered harmful: Prolog predicate with accumulator blows (global) stack, but naive version does not

Question

I have tried a few versions of a simple predicate which pulls random values out of the extra-logical universe and puts them into a list. I assumed that the version with the accumulator would be be tail-call optimized, as nothing happens after recursive call so a path to optimization exists, but it is not (it uses the "global stack"). On the other hand, the "naive version" has apparently been optimized into a loop. This is SWI Prolog.

Why would the accumulator version be impervious to tail-call optimization?

Here are the predicate versions.

Slowest, runs out of local stack space (expectedly)

Here, we just allow a head with function symbols to make things explicit.

% Slowest, and uses 4 inferences per call (+ 1 at the end of recursion). 
% Uses "local stack" indicated in the "Stack limit (1.0Gb) exceeded" 
% error at "Stack depth: 10,321,204":
% "Stack sizes: local: 1.0Gb, global: 7Kb, trail: 1Kb"

oracle_rands_explicit(Out,Size) :- 
   Size>0, !, 
   NewSize is Size-1, 
   oracle_rands_explicit(R,NewSize), 
   X is random_float, 
   Out = [X-Size|R].
oracle_rands_explicit([],0).

?- oracle_rands_explicit(X,4).
X = [0.7717053554954681-4, 0.9110187097066331-3, 0.9500246711335888-2, 0.25987829195170065-1].

?- X = 1000000, time(oracle_rands_explicit(_,X)).
% 4,000,001 inferences, 1.430 CPU in 1.459 seconds (98% CPU, 2797573 Lips)

?- X = 50000000, time(oracle_rands_explicit(_,X)).
ERROR: Stack limit (1.0Gb) exceeded
ERROR:   Stack sizes: local: 1.0Gb, global: 7Kb, trail: 1Kb
ERROR:   Stack depth: 10,321,204, last-call: 0%, Choice points: 6
ERROR:   Possible non-terminating recursion: ...

Faster, and does not run out of stack space

Again, we just allow a head with no function symbols to make things explicit, but we move the recursive call to the end of the body, which apparently makes a difference!

% Same number of inferences as Slowest, i.e. 4 inferences per call
% (+ 1 at the end of recursion), but at HALF the time.
% Does not run out of stack space! Conclusion: this is tail-call-optimized.

oracle_rands_explicit_last_call(Out,Size) :- 
   Size>0, !, 
   NewSize is Size-1,    
   X is random_float, 
   Out = [X-Size|R],
   oracle_rands_explicit_last_call(R,NewSize).
oracle_rands_explicit_last_call([],0).

?- oracle_rands_explicit_last_call(X,4).
X = [0.6450176209046125-4, 0.5605468429780708-3, 0.597052872950385-2, 0.14440970112076815-1].    

?- X = 1000000, time(oracle_rands_explicit_last_call(_,X)).
% 4,000,001 inferences, 0.697 CPU in 0.702 seconds (99% CPU, 5739758 Lips)

?- X = 50000000, time(oracle_rands_explicit_last_call(_,X)).
% 200,000,001 inferences, 32.259 CPU in 32.464 seconds (99% CPU, 6199905 Lips)

Compact, less inferences, and does not run out of stack space

Here we allow function symbols in the head for more compact notation. Still naive recursion.

% Only 3 inferences per call (+ 1 at the end of recursion), but approx % same time as "Faster". % Does not run out of stack space! Conclusion: this is tail-call-optimized.

oracle_rands_compact([X-Size|R],Size) :- 
   Size>0, !, 
   NewSize is Size-1,    
   X is random_float, 
   oracle_rands_compact(R,NewSize).
oracle_rands_compact([],0).

?- oracle_rands_compact(X,4).
X = [0.815764980826608-4, 0.6516093608470418-3, 0.03206964297092248-2, 0.376168614426895-1].

?- X = 1000000, time(oracle_rands_compact(_,X)).
% 3,000,001 inferences, 0.641 CPU in 0.650 seconds (99% CPU, 4678064 Lips)

?- X = 50000000, time(oracle_rands_compact(_,X)).
% 150,000,001 inferences, 29.526 CPU in 29.709 seconds (99% CPU, 5080312 Lips)

Accumulator-based and unexpectedly runs out of (global) stack space

% Accumulator-based, 3 inferences per call (+ 1 at the end of recursion + 1 at ignition),
% but it is often faster than the compact version.
% Uses "global stack" as indicated in the "Stack limit (1.0Gb) exceeded" 
% error at "Stack depth: 12,779,585":
% "Stack sizes: local: 1Kb, global: 0.9Gb, trail: 40.6Mb"

oracle_rands_acc(Out,Size) :- oracle_rands_acc(Size,[],Out).

oracle_rands_acc(Size,ThreadIn,ThreadOut) :- 
   Size>0, !, 
   NewSize is Size-1, 
   X is random_float, 
   oracle_rands_acc(NewSize,[X-Size|ThreadIn],ThreadOut).
oracle_rands_acc(0,ThreadIn,ThreadOut) :-
   reverse(ThreadIn,ThreadOut).

?- oracle_rands_acc(X,4).
X = [0.7768407880604368-4, 0.03425412654687081-3, 0.6392634169514991-2, 0.8340458397587001-1].

?- X = 1000000, time(oracle_rands_acc(_,X)).
% 4,000,004 inferences, 0.798 CPU in 0.810 seconds (99% CPU, 5009599 Lips)

?- X = 50000000, time(oracle_rands_acc(_,X)).
ERROR: Stack limit (1.0Gb) exceeded
ERROR:   Stack sizes: local: 1Kb, global: 0.9Gb, trail: 40.6Mb
ERROR:   Stack depth: 12,779,585, last-call: 100%, Choice points: 6
ERROR:   In:
ERROR:     [12,779,585] user:oracle_rands_acc(37220431, [length:12,779,569], _876)

Addendum: Another version of the "compact" version.

Here we move the Size parameter to the first position, and do not use !. But indexing is a complex matter. Differences will probably be of note with many more clauses only.

oracle_rands_compact2(Size,[X-Size|R]) :- 
   Size>0, 
   NewSize is Size-1,    
   X is random_float, 
   oracle_rands_compact2(NewSize,R).
oracle_rands_compact2(0,[]).

Trying, with L instead of an anonymous variable, and L used after call.

X = 10000000, time(oracle_rands_compact2(X,L)),L=[].
% 30,000,002 inferences, 6.129 CPU in 6.159 seconds (100% CPU, 4894674 Lips)

X = 10000000, time(oracle_rands_compact(L,X)),L=[].
% 30,000,001 inferences, 5.865 CPU in 5.892 seconds (100% CPU, 5115153 Lips)

Maybe marginally faster. The above numbers vary a bit, one would really have to generate a full statistics over a hundred runs or so.

Does re-introducing the cut make it faster (it doesn't seem to make it slower)?

oracle_rands_compact3(Size,[X-Size|R]) :- 
   Size>0, !,
   NewSize is Size-1,    
   X is random_float, 
   oracle_rands_compact3(NewSize,R).
oracle_rands_compact3(0,[]).

?- X = 10000000, time(oracle_rands_compact3(X,L)),L=[].
% 30,000,001 inferences, 5.026 CPU in 5.061 seconds (99% CPU, 5969441 Lips)

Can't say, really.

Concerning `_compact*`: It doe not make sense to compare them for the entire list being present. Instead, only compare them w.r.t. your original query. Otherwise GC dominates. — false, Jan 16 '20 at 13:20

score 3 · Accepted Answer · answered Jan 15 '20 at 21:52

It all depends on the top-level shell and the actual interpretation of the _. Try

 ?- X = 50000000, time(oracle_rands_compact(L,X)),L=[].

instead, it will be more or less as bad as the accumulator version which has to produce the entire list first only to hand it over to reverse/2. To see this use

?- set_prolog_flag(trace_gc, true).
true.

?- X = 50000000, time(oracle_rands_compact(_,X)).
% GC: gained 0+0 in 0.001 sec; used 440+8; free 126,520+129,008
% GC: gained 0+0 in 0.000 sec; used 464+16; free 126,496+129,000
% GC: gained 0+0 in 0.000 sec; used 464+16; free 126,496+129,000
...

?- X = 50000000, time(oracle_rands_compact(L,X)),L=[].
% SHIFT: l:g:t = 0:1:0 ...l+g+t = 131072+262144+131072 (0.000 sec)
% GC: gained 0+0 in 0.002 sec; used 123,024+16; free 135,008+129,000
% SHIFT: l:g:t = 0:1:0 ...l+g+t = 131072+524288+131072 (0.000 sec)
% GC: gained 0+0 in 0.003 sec; used 257,976+24; free 262,200+128,992
% SHIFT: l:g:t = 0:0:1 ...l+g+t = 131072+524288+262144 (0.000 sec)
% SHIFT: l:g:t = 0:1:0 ...l+g+t = 131072+1048576+262144 (0.000 sec)
% GC: gained 0+0 in 0.007 sec; used 520,104+16; free 524,360+260,072
...

If we are at it, your _compact version can be accelerated by exchanging the arguments and removing the cut. Classic first argument indexing is capable of handling this situation avoiding any choice point. (SWI has WAM style 1st argument indexing plus a lesser version for multiple arguments, last time I checked)

One does not even need to do use-after-call with `L=[]`. Just changing `_` to `L` causes the *compact* version to run out of stack space, at the same point. The SWI [indexing scheme seems complex](https://www.swi-prolog.org/pldoc/man?section=jitindex), apparently inspired by YAP. Prolog is compact in writing but the compactness is indicator that massive work on massive datastructures is going on on the call stack. — David Tonhofer, Jan 16 '20 at 09:31
`L = []` avoids that the term is printed, it is quite large. And with a better GC, the list `L` **could** be reduced to `[collected|collected]`. — false, Jan 16 '20 at 13:33