I am running a memory access experiment in which a 2D matrix was used with each row being the size of a memory page. The experiment consists of reading every element using row/column major and then also writing to each element using row/column major. The matrix being accessed was declared with global scope to ease the programming requirements.
The point of this question is that with the test matrix being declared statically, the values are initialized to zero by the compiler and the results I found were quite interesting. When I did read operations first, i.e.
rowMajor_read();
colMajor_read();
rowMajor_write();
colMajor_write();
Then my colMajor_read operation finished very quickly.
However, if I do the write operations before reading we have:
rowMajor_write();
colMajor_write();
rowMajor_read();
colMajor_read();
And the column-major read operation has increased by nearly an order of magnitude.
I figured that it must have something to do with how the compiler optimizes the code. Since the global matrix was identically zero for every element, did the compiler completely remove the read operations? Or is it somehow "easier" to read a value from memory that is identically zero?
I do not pass any special compiler commands with respect to optimizations, but I did declare my functions in this manner.
inline void colMajor_read(){
register int row, col;
register volatile char temp __attribute__((unused));
for(col = 0; col < COL_COUNT; col++)
for(row = 0; row < ROW_COUNT; row++)
temp = testArray[row][col];
}
Because I was running into issues where the compiler completely removed the temp
variable from the above function since it was never being used. I think that having both volatile
and __attribute__((unused))
is redundant, but I included it nonetheless. I was under the impression that no optimizations were implemented on a volatile variable.
Any ideas?
I looked at the generated assembly and the results are identical for the colMajor_read function. The (assembly) non-inline version: http://pastebin.com/C8062fYB