C and MPI: function works differently with same data

Question

I have successfully wrote a complicate function with PETSc library (it's a MPI-based scientific library for parallel solving huge linear systems). This library provides its own "malloc" version and basic datatypes (i.e. "PetscInt" as standard "int"). For this function, I've always been using PETSc stuff instead of standard stuff such as "malloc" and "int". The function has been extensevely tested and always worked fine. Despite the use of MPI, the function is fully serial, and all processors perform it on the same data (each processor has its copy): no communication involved at all.

Then, I decided to not use PETSc and write a standard MPI version. Basically, I rewrote all code substituting PETSc stuff with classic C stuff, not with brutal force but paying attention for substitutions (no "Replace" tool of any editor, I mean! All done by hands). During substitution, few minor changes have been made, such as declaring two different variables a and b, instead of declaring a[2]. These are the substitutions:

PetscMalloc -> malloc

PetscScalar -> double

PetscInt -> int

PetscBool -> created an enum structure to replicate it, as C doesn't have boolean datatype.

Basically, algorithms have not been changed during the substitution process. The main function is a "for" loop (actually 4 nested loops). At each iteration, it calls another function. Let's call it Disfunction. Well, Disfunction works perfectly outside the 4-cycle (as I tested it separately), but inside the 4-cycle, in some cases works, in some doesn't. Also, I checked the data passed to Disfunction at each iteration: with ECXACTELY the same input, Disfunction performs different computations between one iteration and another. Also, computed data doesn't seem to be Undefined Behaviour, as Disfunction always gives back the same results with different runs of the program. I've noticed that changing the number of processors for "mpiexec" gives different computational results.

That's my problem. Few other considerations: the program use extensively "malloc"; computed data is the same for all processes, correct or not; Valgrind doesn't detect errors (apart from detecting error with normal use of printf, which is another problem and an OT); Disfunction calls recursively two other functions (extensively tested in PETSc version as well); algorithms involved are mathematically correct; Disfunction depends on an integer parameter p>0: for p=1,2,3,4,5 it works PERFECTELY, for p>=6 it does not.

If asked, I can post the code but it's long and complicated (scientifically, not informatically) and I think it requires time to be explained.

My idea is that I mess up with memory allocations, but I can't understand where. Sorry for my english and for bad formattation.

Did you tried running valgrind or any similar tool on it ? It is possible that you had a bug that didn't showed up with a different allocation algorithme... — Antzi, Nov 25 '13 at 02:50
Well, Valgrind doesn't seem to detect any out of bound. Also, the algorithm is the same, I just replaced "PetscMalloc" with standard C "malloc". — user3029623, Nov 25 '13 at 03:29
The algorithm I use is numerically stable (as the scientific article which propose it says). Other algorithms in the code work on integer arithmetic. Also, there's no parallelism in that particular part of the code. I use MPI communication in other parts. It doesn't seem to be non-deterministic: results are always the same. They change only with a different number of processors. Can it be a problem of bad allocation? Like allocating a float array as integer? — user3029623, Nov 25 '13 at 18:11
Well, I don't know if anyone is stll interested, but the problem was that PETSc functon PetscMalloc zero-initialize the data, not like standard C malloc. Stupid mistake... — user3029623, Nov 25 '13 at 23:32

score 1 · Answer 1 · answered Oct 31 '14 at 13:20

1

Well, I don't know if anyone is stll interested, but the problem was that PETSc functon PetscMalloc zero-initialize the data, not like standard C malloc. Stupid mistake... – user3029623

answered Oct 31 '14 at 13:20

Armali

18,255
14
57
171

score 0 · Answer 2 · answered Nov 25 '13 at 02:52

0

The only suggestion I can offer without reference to the code itself is to try to construct progressively simpler test cases that demonstrate your issue.

When you narrow down the iterative process to a single point in your data set or a single step (by eliminating some loops), does the error still occur? If not, that might suggest their bounds are wrong.

Does the erroneous output always occur on particular loop indices, especially the first or last? Perhaps there are some ghost or halo values you're missing or some boundary condition that you're not properly accounting for.

answered Nov 25 '13 at 02:52

Phil Miller

36,389
13
67
90

I think I'm going to do something like that, although it is extremely boring. Bounds should work, as they are identical to the PETSc version and in that version they work fine. Also, bounds are mathematically checked. – user3029623 Nov 25 '13 at 03:32
Sorry, pressed enter for mistake. Moreover, wrong iterations are not first or last, but some in the middle, with no obvious pattern. If this is not enough, error occurs only with significant amount of data, like hundreds of iterations! I guess that for a problem like mine, just detailed step-by-step analysis is the solution... Maybe I should learn how to use a debugger... – user3029623 Nov 25 '13 at 03:37

C and MPI: function works differently with same data

2 Answers2