Cache coherence is maintained on a line granularity, usually 64B in most CPUs these days.
The core doing the modification will first request for ownership of the entire line, which means that all other cores must invalidate their copies (if they had such). Any other core attempting a read will have to requests that line, which will result in a snoop getting sent to the modifying core. From there you have two options:
Either the modifying core completed the read-modify-write sequence, and the line sits in its cache with the latest modified data - in this case the snoop will initiate a WB sequence, the updated line will become available to all, and the 2nd core may read any byte from it.
The modifying core acquired the line through the load, but its store has still not made the change (which is possible since stores are usually performed much later in the pipe, while loads are often done speculatively). In this case, the core must protect the line from being snooped, usually through implementing an internal lock for such operations. Note that on x86 for e.g., most atomic read-modify-write operations require a lock prefix. Also note that a normal sequence of read + write (non atomic) would simply lose the line at that point, and acquire it again for the store later, thereby losing coherence.
Edit: following Paul's comment, it's indeed possible to design a cache system that allows sub-line granularity tracking. This basically means decoupling the basic block of the MESI protocol from the basic block size used for caching, you would need to add state bits per each subset (but can still use a single tag for all), invalidate only local subsets, and eventually do a merge somehow to regain the full line. However, the overhead would make it quite rare, and i'm not familiar with commercial CPUs doing all this just to avoid false sharing. Either way, since such a sub block would probably not be byte-sized, the original question still applies for bytes within the same block.