why garbage collectors have that pause in general?
GCs work by tracing reachable heap blocks starting from a set of global roots (global variables,
thread stacks and CPU registers). GCs lie on a sliding scale from snapshot to on-the-fly.
Snapshot GCs work from a snapshot of the global roots and heap topology. On-the-fly GCs
incrementally update their interpretation of the heap as the mutators run.
Tracing is not "why garbage collectors have that pause in general", there are bigger reasons for pausing: object relocation being the dominant one.
But to the point of on-the-fly vs. snapshot tracers and their relative efficiency: On-the-fly tracing can bemore efficient that snapshot-at-the-beginning tracers. The same paper you refer to in describing [VCGC] classifies Azul's previous, non-generational Pauseless collector as a precise wavefront [3] tracer:
"... Most practical collectors use conservative abstractions of the wavefront rather than the precise definition provided here. That is, the wavefront is tracked at an object granularity. However, the precise wavefront is not merely theoretical and has recently been used in the hardware-assisted collector for the Azul Java server, which has a “not marked-through” bit in every pointer [2]."
Azul's C4 shares this quality, but achieves it using a pure-software, self-healing LVB read barrier.
Wholly snapshot GCs attain high throughput because the collector runs almost entirely
independently of the mutators but have high latency because taking a snapshot incurs a pause.
Wholly on-the-fly GCs attain low latency because everything is done incrementally but low
throughput because of the fine-grained communication between the mutators and GC.
A precise wavefront tracer is as efficient (from the "not spending time on unneeded objects" point of view discussed in the paper) as a stop-the-world tracer, which by definition also have a recise wavefront. Compared to snapshot-based approaches, precise wavefront scanning doesn't slow throughput in any way, or require any more communication between the collector and the mutator. It has as-good-or-better throughput, as it's preciseness assures that it never has to repeat any tracing work.
why Azul's gc doesn't have that problem?
They do still have this problem but they lessened it by implementing a read barrier in hardware.
Read barriers had been proposed before but software read barriers degrade throughput too much
because pointer reads are much more common than writes.
As noted, if "the problem" were inefficiency due to on-the-fly vs. snapshot behavior, then C4 doesn't have it because it is a precise wavefront tracer. Furthermore, Azul's C4 collector doesn't need or use a hardware read barrier, as it runs on vanilla x86 and Linux systems, and achieves better throughput on that hardware that snapshot based tracing collectors do (see throughput comparisons in [1])...
However, "the problem" in the question was stated as "why garbage collectors have that pause in general?" Wavefront preciseness (or not) has little to do the dominant pauses in garbage collectors. Concurrent and mostly-concurrent (even if less efficient than C4) markers do exist, but their collectors still pause. The issue is that tracing is only a part of collection. Tracing only tells you what is alive and where it is. It doesn't give you any memory back, and it certainly doesn't de-fragment fragmented memory. That subject is discussed in depth in various academic papers (see slew of references from C4 paper [1]).
It is compaction (and the implied object relocation) that seems to be the achilles heel of currently shipping collectors on server JVMs, and the thing that inherently causes them to take [large] pauses. The simple act of relocating even a single object from one place to another means that you have to FIX all references pointing to that object before the program uses them. For most commercially shipping collectors, this means a stop-the-world pause that keeps the application from running while references are being fixed.
C4 leverages the self-healing LVB barrier (a new type of read barrier introduced in [2] and used heavily in [1] in software-only form) to avoid having to fix references before the application is allowed to run. That's how it avoids the pause that other collectors end up having to take. The self-healing quality reduces the dynamic cost of the read barriers by several order of magnitude compared to previous, non-self-healing barriers (such as the brooks-style barrier used in other concurrent compactors in academic works, and in some real-time collectors). The result of this dramatically lower read-barrier cost is that it is practical for use in generational collection and on server-scale JVMs.
[1]: "C4: the continuously concurrent compacting collector"http://dl.acm.org/citation.cfm?id=1993491&dl=ACM&coll=DL&CFID=85063603&CFTOKEN=84074207
[2]: "The Pauseless GC Algorithm"http://static.usenix.org/events/vee05/full_papers/p46-click.pdf
[3]: "Correctness-Preserving Derivation of Concurrent Garbage Collection Algorithms" www.srl.inf.ethz.ch/papers/pldi06-cgc.pdf
(Graham Thomas, EMEA Technical Manager, Azul Systems)