After many hours of debugging and analysis, I have finally managed to isolate the cause of a race condition. Solving it is another matter!
To see the race condition in action, I recorded a video some way in to the debugging process. I have since furthered my understanding of the situation so please forgive the poor commentary and the silly mechanisms implemented as part of the debugging process.
http://screencast.com/t/aTAk1NOVanjR
So, the situation: we have a double buffered implementation of a surface (i.e. java.awt.Frame or Window) where there is an ongoing thread that essentially loops continuously, invoking the render process (which performs UI layout and renders it to the backbuffer) and then, post-render, blits the rendered area from backbuffer to screen.
Here's the pseudo-code version (full version line 824 of Surface.java) of the double buffered render:
public RenderedRegions render() {
// pseudo code
RenderedRegions r = super.render();
if (r==null) // nothing rendered
return
for (region in r)
establish max bounds
blit(max bounds)
return r;
}
As with any AWT surface implementation, it also implements (line 507 in AWT.java - link limit :( - use Surface.java link, replace core/Surface.java with plat/AWT.java) the paint/update overrides which also blit from the backbuffer to the screen:
public void paint(Graphics gr) {
Rectangle r = gr.getClipBounds();
refreshFromBackbuffer(r.x - leftInset, r.y - topInset, r.width, r.height);
}
Blitting is implemented (line 371 in AWT.java) using the drawImage() function:
/** synchronized as otherwise it is possible to blit before images have been rendered to the backbuffer */
public synchronized void blit(PixelBuffer s, int sx, int sy, int dx, int dy, int dx2, int dy2) {
discoverInsets();
try {
window.getGraphics().drawImage(((AWTPixelBuffer)s).i,
dx + leftInset, dy + topInset, // destination topleft corner
dx2 + leftInset, dy2 + topInset, // destination bottomright corner
sx, sy, // source topleft corner
sx + (dx2 - dx), sy + (dy2 - dy), // source bottomright corner
null);
} catch (NullPointerException npe) { /* FIXME: handle this gracefully */ }
}
(Warning: this is where I start making assumptions!)
The problem here seems to be that drawImage is asynchronous and that a blit from refreshBackBuffer() via paint/update is called first but occurs second.
So... blit is already synchronized. The obvious way of preventing the race condition doesn't work. :(
So far I have come up with two solutions, but neither of them are ideal:
re-blit on the next render pass
cons: performance hit, still get a bit of flicker due when encountering the race condition (valid screen -> invalid screen -> valid screen)do not blit on paint/update, instead set refresh bounds and use those bounds on next render pass
cons: get black flicker when the screen is invalidated and the main application thread is catching up
Here (1) seems to be the lesser of two evils. Edit: and (2) doesn't work, getting blank screens... (1) works fine but is just masking the problem which is potentially still there.
What I'm hoping for, and seem unable to conjure up due to my weak understanding of synchronized and how to use it, is a locking mechanism that somehow accounts for the asynchronous nature of drawImage().
Or perhaps use ImageObserver?
Note that due to the nature of the application (Vexi, for those interested, website is out of date and I can only use 2 hyperlinks) the render thread must be outside of paint/update - it has a single-threaded script model and the layout process (a sub-process of render) invokes script.