Why things that are used inside openmp parallel blocks not collected by Boehm GC afterwards?

Question

I am using Boehm-GC in my C program for garbage collection. I am trying to parallelize a for loop which works on an array. The array is allocated through GC_malloc. When the loop is done executing, the array is not used anymore in the program. I call GC_gcollect_and_unmap which frees the array. However when I parallelize the for loop using openmp, the array is never freed after the loop is done executing. It is the exact same program, I only add #pragmas around the loop to parallelize it. I have tried looking at the assembly code side by side with and without openmp parallelization, I see that the array pointer is being handled in a similar way and don't see extra pointers being kept anywhere. The only difference is that the for loop is implemented as a simple loop within the main function but when I parallelize it, openmp creates a new function ##name##._omp_fn and calls it. Anyhow, is there something I need to do so that the Boehm-GC collects the array? It is hard for me to post an MWE because if the program is small enough, Boehm-GC doesn't kick in at all.

Here is a code excerpt without parallelization.

  struct thing {
    float* arr;
    int size;
  }
  int l=10;
  static thing* get_randn(void) {
    thing* object = (thing*)GC_malloc(sizeof(struct {float* arr, int size}));
    object->arr=malloc(sizeof(float)*l);
    void finalizer(void *obj, void* client_data)
    { 
      printf("freeing %p\n", obj); 
      thing* object = (thing*)obj;
      free(object->arr);
    }
    GC_register_finalizer(object, &finalizer, NULL, NULL, NULL);
    float *arr = object->arr; 
    int t_id;
    for (t_id = 0; t_id<l; t_id++) { 
       torch_randn(arr+t_id); 
    } 
    return object;                          
  }

The above code garbage collects the object produced by the function. Following is the code with parallelization.

  struct thing {
    float* arr;
    int size;
  }
  int l=10;
  static thing* get_randn(void) {
    thing* object = (thing*)GC_malloc(sizeof(struct {float* arr, int size}));
    object->arr=malloc(sizeof(float)*l);
    void finalizer(void *obj, void* client_data)
    { 
      printf("freeing %p\n", obj); 
      thing* object = (thing*)obj;
      free(object->arr);
    }
    GC_register_finalizer(object, &finalizer, NULL, NULL, NULL);
    float *arr = object->arr; 
    int t_id;
    #pragma omp parallel num_threads(10)
    {
     #pragma omp for
     for (t_id = 0; t_id<l; t_id++) { 
       torch_randn(arr+t_id); 
     }
    } 
    return object;                          
  }

For this code, object does not get garbage collected. It is difficult to reproduce the problem just by itself through an MWE because garbage collector doesn't kick in for small programs, but I am observing this behavior when I run with my full program.

How about a [mcve] that demonstrates the issue? A prose description by istelf rarely captures all the relevant details. — John Bollinger, May 10 '19 at 20:44

Gregor Budweiser · Answer 1 · 2019-05-19T17:42:19.690

It is difficult to reproduce the problem just by itself through an MWE because garbage collector doesn't kick in for small programs, but I am observing this behavior when I run with my full program.

You can force garbage collection by calling GC_gcollect().

Also Boehm-GC definitely does free memory/objects allocated within parallel sections. But there is at least one caveat: OpenMP uses a thread pool internally. This means the threads are not necessarily terminated after the parallel section ends. Those pooled and idle threads may still have references to the objects on the heap.

Consider the following program which runs four threads in parallel and allocates a thousand "objects" per thread:

#define GC_THREADS
#include <assert.h>
#include <stdio.h>
#include <omp.h>
#include <gc.h>

#define N_THREADS 4
#define N 1000

// count of finalized objects per thread
static int counters[N_THREADS];

void finalizer(void *obj, void* client_data)
{
#pragma omp atomic
    counters[*(int*)obj]++;
}

int main(void)
{
    GC_INIT();
    GC_allow_register_threads();

    int i;
    for(i = 0; i < N_THREADS; i++) {
        counters[i] = 0;
    }

    // allocate lots integers and store the thread id in it
    // execute N iterations per thread
#pragma omp parallel for num_threads(4) schedule(static, N)
    for (i = 0; i < N_THREADS*N; i++)
    {
        struct GC_stack_base sb;
        GC_get_stack_base(&sb);
        GC_register_my_thread(&sb);

        int *p;
        p = (int*)GC_MALLOC(4);
        GC_REGISTER_FINALIZER(p, &finalizer, NULL, NULL, NULL);
        *p = omp_get_thread_num();
    }

    GC_gcollect();
    for(i = 0; i < N_THREADS; i++) {
        printf("finalized objects in thread %d: %d of %d\n", i, counters[i], N);
    }
    return 0;
}

Example output:

finalized objects in thread 0: 1000 of 1000
finalized objects in thread 1: 999 of 1000
finalized objects in thread 2: 999 of 1000
finalized objects in thread 3: 999 of 1000

The numbers imply that threads 1 to 3 are pooled and still hold the reference to the object of the last iteration. Thread 0 is the main thread which continues execution and thus looses the reference of the last iteration on the stack.

Edit: @maddy: I don't think it has anything to do with registers or compiler optimizations. As a rule of thumb, a compiler may only perform optimizations that are guaranteed to not change the behavior of the program. Admittedly your problem might be a corner case.

According to Wikipedia, Boehm-GC looks for references in the program stack. Depending on how the compiler transforms the openmp pragmas into code, it may very well be that the stack-frame containing the reference to the heap is still valid when the thread enters the idle state. In that case Boehm-GC by definition cannot finalize the referenced object/memory. But it's hard to reason about this IMHO. You would need to get a good understanding of what your compiler does with openmp pragmas and how exactly Boehm-GC analyses the program stack.

The point is: As soon as you reuse the threads (by running something else with openmp) the stacks of the pooled threads will be overwritten and Boehm-GC will be able to reclaim the memory from the previous parallel iteration. In the long run you are not leaking memory.

Do you think its because threads are not necessarily terminated or because those threads leave pointers to the variables in the registers and if the program is small and there isn't a lot of register pressure, gcc won't bother overwriting those registers if it doesn't need them (because gcc is so optimized). But the garbage collector can only scan the registers and heap and if it sees a pointer it assumes that the variable is live, the compiler doesn't let it know that the variable is dead and the register just holds an old value which didn't get overwritten — maddy99, May 18 '19 at 17:16

Why things that are used inside openmp parallel blocks not collected by Boehm GC afterwards?

1 Answers1