MPI: time for output increases when the number of processors increase

Question

I have a problem printing a sparse matrix in a c++/mpi program that I hope you could help me solve.

Problem: I need to print a sparse matrix as a list of 3-ples (x, y, v_xy) in a .txt file in a program that has been parallelized with MPI. Since I am new to MPI, I decided not to deal with the parallelized IO instructions provided by the library and let the master processor (0 in my case) print the output. However, the time for printing the matrix increases when I increase the number of processors:

1 processor: 11,7 secs
2 processors: 26,4 secs
4 processors: 25,4 secs

I have already verified that the output is exactly the same in the three cases. Here is the relevant section of the code:

if (rank == 0)
{    
    sw.start();

    std::ofstream ofs_output(output_file);
    targets.print(ofs_output);
    ofs_output.close();

    sw.stop();
    time_output = sw.get_duration();
    std::cout << time_output << std::endl;
}

My stopwatch sw is measuring wall clock time using the gettimeofday function. The print method for the targets matrix is the following:

void sparse_matrix::print(std::ofstream &ofs)
{
    int temp_row;
    for (const_iterator iter_row = _matrix.begin(); iter_row != _matrix.end(); ++iter_row)
    {
        temp_row = (*iter_row).get_key();
        for (value_type::const_iterator iter_col = (*iter_row).get_value().begin();
        iter_col != (*iter_row).get_value().end(); ++iter_col)
        {
            ofs << temp_row << "," << (*iter_col).get_key() << "," << (*iter_col).get_value() << std::endl;
        }
    }
}

I do not understand what is causing the slow-down since only processor 0 does the output and this is the very last operation of the program: all the other processors are done while processor 0 prints the output. Do you have any idea?

Are these actually three different machines or are you testing that code by pretending to have three processors on one? — stefan, Feb 07 '15 at 11:44
@stefan: I am using a i7 quad-core processors (dell XPS 15). I forgot to mention that I am executing the code on a Oracle linux virtualbox to which I allocated 4 processors in the settings. I can't figure out the dependence of the print execution time wrt to the number of processors, since only processor 0 executes the print instruction. — Pierpaolo Necchi, Feb 07 '15 at 13:00
On your virtual box, switching to more than one processor will create an overhead which slows down your system. The conditions of your measurement experience are hence not relevant ! By the way, gettimeofday() is not the best function to measure performance (see http://linux.die.net/man/2/gettimeofday under the heading "Notes"). — Christophe, Feb 07 '15 at 15:24
@Christophe: Thank you for the useful reference. I will keep it in mind. Concerning the slow-down, is the overhead present only when I execute my code using `mpirun -np k` with `k >= 2`? In this case, do you think that allocating more RAM to the VM will improve output performances? I am currently using 4Gb on the VM out of 16Gb available on my laptop. Thank you for your help — Pierpaolo Necchi, Feb 07 '15 at 18:24
@PierpaoloNecchi Yes, because when k>=2 the real processor starts to be shared and used by more processes. Thus a lot more context switches. But potentially also more latency for process synchronisation. I do'nt think that memory allocation would help, although I don't know the desing of your WM. — Christophe, Feb 07 '15 at 19:56
@Christophe: Thank you for the explanation: I was mixing up cores and processors. If I got it well, my computer has only one processor with 4 cores. So, when launch 4 processes with mpi, each process will be run on one of the cores. Therefore, I need to pay a overhead for letting my processor manage his 4 cores. Hence, the slow-down when I print the output. Is this correct? Still, a slowdown of x2.5 seems huge... — Pierpaolo Necchi, Feb 07 '15 at 23:19
Yes, this slow-down looks terrible. When you use more core, each core is a little bit slower, but the overall throughput of the processor is higher. The problem with virtual processors is that each environment is more than just a thread running on a core. Have you tried running your MPI without the WM ? Normally, the MPI should be able to take advantage of native cores without overhead. Look also here: http://stackoverflow.com/questions/5797615/mpi-cores-or-processors — Christophe, Feb 07 '15 at 23:27

score 0 · Accepted Answer · answered Feb 11 '15 at 12:03

Well, I finally understood what was causing the problem. Running my program, parallelized on MPI, on a linux virtual machine drastically increased the time for printing a large amount of data in a .txt file when increasing the number of cores used. The problem is caused by the virtual machine, which does not behave correctly when using MPI. I tested the same program on a physical 8-core machine and the time for printing the output does not increase with the number of cores used.

MPI: time for output increases when the number of processors increase

1 Answers1