Is reading a "zero" from memory faster than reading other values?

Question

I am running a memory access experiment in which a 2D matrix was used with each row being the size of a memory page. The experiment consists of reading every element using row/column major and then also writing to each element using row/column major. The matrix being accessed was declared with global scope to ease the programming requirements.

The point of this question is that with the test matrix being declared statically, the values are initialized to zero by the compiler and the results I found were quite interesting. When I did read operations first, i.e.

rowMajor_read();
colMajor_read();
rowMajor_write();
colMajor_write();

Then my colMajor_read operation finished very quickly. enter image description here

However, if I do the write operations before reading we have:

rowMajor_write();
colMajor_write();
rowMajor_read();
colMajor_read();

enter image description here

And the column-major read operation has increased by nearly an order of magnitude.

I figured that it must have something to do with how the compiler optimizes the code. Since the global matrix was identically zero for every element, did the compiler completely remove the read operations? Or is it somehow "easier" to read a value from memory that is identically zero?

I do not pass any special compiler commands with respect to optimizations, but I did declare my functions in this manner.

inline void colMajor_read(){
    register int row, col;
    register volatile char temp __attribute__((unused));
    for(col = 0; col < COL_COUNT; col++)
        for(row = 0; row < ROW_COUNT; row++)
            temp = testArray[row][col];
}

Because I was running into issues where the compiler completely removed the temp variable from the above function since it was never being used. I think that having both volatile and __attribute__((unused)) is redundant, but I included it nonetheless. I was under the impression that no optimizations were implemented on a volatile variable.

Any ideas?

I looked at the generated assembly and the results are identical for the colMajor_read function. The (assembly) non-inline version: http://pastebin.com/C8062fYB

I concur with @Nit. Cache locality is most likely the source of the variance. Caches can easily give a 10x access time improvement. If you seriously suspect the compiler optimizing away operations (unlikely across functions, but not strictly impossible), get an assembler output of your C functions to check. — Jonathan Eunice, Oct 31 '14 at 17:32
Hang on guys. I don't think its all that complicated. Because the methods are inlined, that means that all of these functions are within the same compilation unit, so the compiler can do fantastic things. Mainly, it can tell whether or not you have changed the variable since the read and write, so it could easily be reinterpreting the code as `temp = 0;` which would be crazy fast by comparison. Can you post the assembly? — IdeaHat, Oct 31 '14 at 17:50
If I were you, I'd take a close look at the assembler code that this compiles into. I recommend using optimization level `-Os` for this, as it yields the most readable assembler code. — cmaster - reinstate monica, Oct 31 '14 at 18:25

score 7 · Answer 1 · answered Oct 31 '14 at 17:58

Check the memory usage of your process before and after writing out values to the matrix. If it's stored in the .bss section on Linux, for example, the zeroed pages will be mapped to a single read-only page with copy-on-write semantics. So, even though you're reading through a bunch of addresses, you may be reading the same page of physical memory over and over.

This page http://madalanarayana.wordpress.com/2014/01/22/bss-segment/ has a good explanation.

If that's the case, zero out the matrix again afterward and rerun your read test and it should no longer be so much faster.

+1 Was just about to post this when I noticed I was 16 hours late. — user541686, Nov 01 '14 at 10:34

Is reading a "zero" from memory faster than reading other values?

1 Answers1