A problem here is read-modify-write.
Say you have two seperate physical cores (they can be in the same physical package). They both read. The first core modifies - which is to say, increments the value, which is currently held in a register. At this point, the modification begins to propagate out to the caches - but the second core in the meantime has also modified the value and also began to propagate the cache out to cache.
You lose one of the updates.
Cache coherency protocols do not handle the case of multiple concurrent writes. There is nothing which causes one core to wait on its write because another core is also -currently writing-; because that information is simply not publically available between the cores. They -cannot- do it.
They do handle multiple consecutive writes, e.g. after the changes have been seen on a cores external bus pins (e.g. become public knowledge, rather than being internal to the core).
Another problem is instruction re-ordering. These threads - if they are running on different cores, their instruction re-ordering will not pay attention to what the other threads are doing; only to what that thread in particular is doing.
Imagine one thread is going to write the value and then set a flag. Another thread will see the flag raised and then read the value. Those threads, if on seperate cores, will re-order their instruction flow only with regard to themselves - the cores will not consider the other threads, because they cannot - they have no knowledge of them.
So in the first thread the flag setting may be re-ordered to occur before the writing of the value - after all, for just that thread, that re-ordering is fine. Those two variables are entirely disconnected. There is no ordering dependency. The dependency exists in another thread which is on a different core and so our core can't know about it.
The other thread of course will see the flag raised and read, even though the write actually hasn't happened yet.