0

I already saw several posts on this site which talk about this issue. However, I think my serious codes where overhead due to creation of threads and all should not be a big issue, have become much slower with open mp now! I am using a quad core machine with gfortran 4.6.3 as my compiler. Below is an example of a test code.

Program test
use omp_lib
integer*8 i,j,k,l
!$omp parallel 
!$omp do
do i = 1,20000
  do j = 1, 1000
   do k = 1, 1000
       l = i
   enddo
  enddo
enddo
!$omp end do nowait
!$omp end parallel
End program test

This code takes around 80 seconds if I run it without open mp, however, with open mp, it takes around 150 seconds. I have seen the same issue with my other serious codes whose runtime is around 5 minutes or so in serial mode. In those codes I am taking care that there are no dependencies from thread to thread. Then why should these codes become slower instead of faster?

Thanks in advance.

Community
  • 1
  • 1
Peaceful
  • 4,920
  • 15
  • 54
  • 79

1 Answers1

5

You have a race condition, more threads are writing in the same shared l. Thus the program is invalid, l should be private. It also leads to a slowdown because the threads invalidate the cache content the other cores have and the threads have to reload the memory content all the time. Similar thing happens when more threads use the same cache line and it is known as false sharing.

You also probably don't use any compiler optimizations. Enable them by -O2 -O3, -O5 or -Ofast. You will see that the program takes 0 seconds because the compiler optimizes everything out.

  • Somehow using -0fast (or other optimizations) give many warning (error?) messages on my screen and executable is not made. – Peaceful Dec 11 '14 at 06:36
  • Ofast was introduced in a later version of gfortran than 4.6. – Vladimir F Героям слава Dec 11 '14 at 07:01
  • Can you elaborate a bit on your sentence 'It also leads to..' with reference to the code I posted? I would like to know exactly how does that lead to slowing of my code. – Peaceful Dec 11 '14 at 10:13
  • Read the wikipedia on false sharing and you will see the issue. The cach is invalidated and has to be re read by every write by some other thread. But the main issue is the race condition which makes the code competely invalid! – Vladimir F Героям слава Dec 11 '14 at 10:19
  • Let us replace l = i by something else, say print*,'Hi'. True that program is invalid. But is that the reason the code becomes slow? I understood nothing from false sharing article. Can you explain me the exact problem in simpler terms? – Peaceful Dec 11 '14 at 10:30
  • Yes that is the reason. Accessing memory is slow. All cores have cache of parts of the memory. If something in that part of the main memory is overwritten by any of the threads the thread that cached the part has to reload the memory content to the cache. It takes a lot of time. You must understand this or you will not be able to write fast parallel programs. Try to understand the wiki or http://docs.oracle.com/cd/E19205-01/819-5270/aewcy/index.html. With `print Hi` it will be again a competition for a shared resource, but a different one - the shared output stream - the result will be slow. – Vladimir F Героям слава Dec 11 '14 at 12:59