Available cores vs. number of processes in openMPI

Question

I tried the following "hello world" code, first on my system(8 cores) then on server(160 cores):

int main(int argc, char *argv[]) {
  int numprocs, rank, namelen;
  char processor_name[MPI_MAX_PROCESSOR_NAME]; 
  double t;
  t=MPI_Wtime();
  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Get_processor_name(processor_name, &namelen);

  //printf("Process %d on %s out of %d\n", rank, processor_name, numprocs);
  printf("%f---%d---%s\n",MPI_Wtime()-t,rank,processor_name);
  sleep(.5);//to make sure, each process needs a significant amount of time to finish
  MPI_Finalize();
}

I run the program with 160 processes using mpirun -np 160 ./hello I expected server run to be more efficient as it have a single core available for each process at the starting point, but the result was opposite.

8 cores : 2.25 sec
160 cores : 5.65 sec

Please correct me, if I am confused regarding the core assignment to each process. Also please explain how the mapping is done by default? I know there are several ways to do it manually, either by using a rankfile or using some options related to socket/core affinity. I want to know how the processes are treated in openMPI and how they are given resources by default?

i assume ,it is not due to the SSH lag or something like that. — Ankur Gautam, Jul 09 '13 at 20:26
If you're not asking a question that's specifically language agnostic, don't tag both C and C++. — Wug, Jul 09 '13 at 20:29
@Wug I will take care from the next time.I thought it to be language specific due to the openMPI implementation. — Ankur Gautam, Jul 09 '13 at 20:35
From documentation about parallel programming I have read, you only get a speedup to a certain point and then you can actually slowdown because of all the overhead. — , Jul 09 '13 at 20:42
@Matt2234 but should't it depend upon the available processor/core ? i gets slowed only because of the extra waiting time.which should have happened with the case of 8 cores. — Ankur Gautam, Jul 09 '13 at 20:45
Try this again with some difficult problem to solve and you'd probably get different results, you would think 8 would be slower, but with all the overhead from 160 cores moving between memory you are getting slowed down. — , Jul 09 '13 at 20:47
what type of memory is being transferred ? this is not a shared memory model.So exactly of what overhead you are talking about ? — Ankur Gautam, Jul 09 '13 at 20:51
Calling `MPI_Wtime()` before `MPI_Init()` results in a non-standard conforming MPI program. The standard allows for implementations to provide synchronised global clocks and those require that `MPI_Init()` is called first. Keep this remark in mind. — Hristo Iliev, Jul 09 '13 at 22:55
That was tricky.I changed the MPI_Wtime() position.And there was a difference in processing time in both the cases.Overall it is fast in server but there is only a slight difference. thanks @HristoIliev for helping me time to time. Can you also please explain the default process to core/socket mapping in openMPI ? I mean to say does a core is allotted to each process or that occurred in a complex picture? also i want to know hos is it possible to execute 160 process with atleast .5 sec sleep with 8 processors in almost 1 second ?isn't there any waiting queue or something like that. — Ankur Gautam, Jul 09 '13 at 23:14
and the new timings are 8 cores :1.0023000 160 cores : 1.0000023 — Ankur Gautam, Jul 09 '13 at 23:14
The default mapping is no mapping at all. Ranks are allocated one after another until all slots on the node are full, then it moves to the next machine in the host file. When the last node is full, it begins oversubscribing the nodes starting with the first one. Note that `MPI_Init()` takes progressively more time with increasing number of processes and this behaviour is not limited to Open MPI only. — Hristo Iliev, Jul 09 '13 at 23:15
Sleeping doesn't use any CPU resources and even the OS task scheduler pays no attention to the process unless it is woken up by a signal or at the end of the sleep interval. You can have thousands of processes, all sleeping at the same time and for almost the same amount of time. — Hristo Iliev, Jul 09 '13 at 23:22

score 2 · Answer 1 · answered Jul 09 '13 at 20:44

2

You're not actually measuring the performance of anything that would benefit from scale here. The only thing you're measuring is the startup time. In this case, you would expect that starting more processes would take more time. You have to launch the processes, wire-up the network connections, etc. Also, both your laptop and the server have one process per core so that doesn't change from one to the other.

A better measurement of testing whether having more cores is more efficient is to do some sort of sample calculation and measure the speedup from having more cores. You could try the traditional PI calculation.

answered Jul 09 '13 at 20:44

Wesley Bland

8,816
3
44
59

A better measurement of testing whether having more cores is more efficient is to do some sort of sample calculation : dont't you think, making a sleep call is same as computation for a given process? I explicitly provided it just because to take into account some more stuff other than the start up time . – Ankur Gautam Jul 09 '13 at 20:47
A sleep call isn't computation. It's just causing the processes to do nothing. The entire reason that you use parallel processing (whether MPI, threads, or something else) is that you want to take a large problem that you either couldn't solve on your local machine or would take too long to solve, and you divide the problem up so it can be solved simultaneously by many processes. In you application, there isn't any computation. You're only measuring the time it takes to start all of the processes. – Wesley Bland Jul 09 '13 at 20:52
My point is : this is very sure, each process must need at least one core for execution.So do you think how the processing time with my computer(8 cores) can be around 2 sec? If there are 160 processes and every process needs on an average .5 seconds before it finishes, there must be waiting time during the overall processing. – Ankur Gautam Jul 09 '13 at 20:59
1

I don't know what else I can say. Obviously you're not understanding what I'm trying to explain. What you are calculating isn't processing time. You're not doing any processing. You're just starting an application and the stopping it again. – Wesley Bland Jul 09 '13 at 21:23
Jjust for the record. A desktop CPU is likely to have a much higher IPC (instructions per Clock) and clock speeds compared to a server. If the server is old, even more so. – pmav99 Sep 29 '21 at 10:12

Available cores vs. number of processes in openMPI

1 Answers1