"The H.264 Advanced Video Compression Standard" by Iain Richardson is the standard book. For full details the specification is available.
Each pixel is produced by combining a prediction with a residual.
In an Intra-frame the prediction for a square block of pixels is made by copying the pixels to the left or above that block. (Which pixels to copy are specified by bits in the bitstream - and in some modes the prediction is formed from a filtered version of the pixels instead of a straight copy.)
For the very first block in an image, there are no previously decoded pixels, so the prediction is set to value 128.
Once you have a prediction, a value (called the residual) is added to this to form the final value for the pixel (assuming deblocking is turned off). The value of the residual is contained in the bitstream (actually a transformed version of the residual as the transform means fewer bits are needed to encode the residual).
So, in summary, the bitstream first specifies a number which says which method to use to copy/filter previously decoded pixels to form a prediction, and another set of numbers which specify what value to add to this prediction to get the final pixels.
The aim is that the prediction is very close to the actual image so few bits need to be spent on the residual.