I've got a few thoughts, and a possibly-workable solution you can consider.
First, consider tracking the individual pixel deltas and transmitting/storing just those. A typical interactive session usually involves very small parts of the UI changing; moving or resizing windows tends to be less common (anecdotally) for long computer use sessions. This efficiently captures simple things like entered text, cursor movements and small UI updates without a lot of extra work.
You could also consider trying to hook the OS at a lower level, to get e.g. a display list of pixels, or even (optimally) a list of 'damage' rectangles. Mac OS X's Quartz compositor can give you this information, for example. This can help you quickly narrow down what to update, and in the ideal case, may give you an efficient representation of the screen in and of itself.
If you can query the OS's (window manager's) information about windows, you can store separate streams of data (pixel deltas) for every visible window, and then apply a simple display-list approach to 'render' them during playback. Then, it is trivial to identify moving windows since you can simply diff the display lists.
If you can query the OS's information about the cursor position, you can use the cursor movement to quickly estimate movement deltas, since cursor moves usually correlate well with object movement on screen (e.g. moving windows, icons, dragging objects, etc.). This allows you to avoid processing the image to determine movement deltas.
On to a possible solution (or a last resort in case you still can't identify the movement delta with the above): we can actually deal with the (very common) case of a single moving rectangle reasonably easily. Make a mask of all the pixels that change in the frame. Identify the largest connected component in the mask. If it approximates a rectangle, then you can assume it represents a moved region. Either the window moves exactly orthogonal (e.g. entirely in the x- or y- direction), in which case the total delta looks like a slightly bigger rectangle, or the window moves diagonally, in which case the total delta will have an 8-sided shape. Either way, you can estimate the motion vector, and verify this by diffing the regions. Note that this treatment deliberately ignores details that you will have to consider, e.g. pixels moving independently near the windows, or regions which don't appear to change (such as large blocks of solid colour in the window). A practical implementation would have to deal with all of the above.
Finally, I'd look into existing literature on real-time motion estimation. A lot of work has been done in optimizing motion estimation and compensation for e.g. video encoding, so you may be able to use that work as well if you find the methods above inadequate.