Cuda unified memory: Program gets different result when use pointer or non-pointer object as class member

Question

Recently, I learnt how to code using the cuda unified memory. But what weird is that the kernel reports different result when I replace the pointer object by the non-pointer one.

Please refer to the Core.cuh and main.cu.

The ClassManaged.h is base class for new and delete overload and CMakeList.txt for build the test case.

//ClassManaged.h   This file overloads the new and delete operator for children class

#ifndef __CLASS_MANAGED_H__
#define __CLASS_MANAGED_H__

#include <cuda_runtime_api.h>

class Managed
{
public:
    void *operator new(size_t len)
    {
        printf("-->Managed call!\n");

        void *ptr;
        cudaMallocManaged(&ptr, len);
        cudaDeviceSynchronize();

        printf("  Address for Managed constructor: %p\n", ptr);

        return ptr;
    }

    void operator delete(void *ptr)
    {
        cudaDeviceSynchronize();
        cudaFree(ptr);
    }

    void* operator new[] (size_t len) 
    {
        void *ptr;
        cudaMallocManaged(&ptr, len);
        cudaDeviceSynchronize();
        return ptr;
    }
    
    void operator delete[] (void* ptr) 
    {
        cudaDeviceSynchronize();
        cudaFree(ptr);
    }
};
#endif

//Core.cuh    where the bug appears
#ifndef __CORE_CUH__
#define __CORE_CUH__

#include "ClassManaged.h"
#include "cuda_runtime.h"
#include <string>
#include "stdio.h"

class Box : public Managed{
  public:
    int a;
    int b;
};

class Core : public Managed{
    public:
        __host__ __device__     Core(int cnumin)
        {
            c_num = cnumin;
        }
        __host__ __device__     ~Core() 
        {
            cudaFree(datan);
        }
        void    initialize()
        {
            cudaMallocManaged((void**)&datan,             offset*c_num*sizeof(int));
            //cudaMallocManaged((void**)&box,             sizeof(Box));    // Test case 1 pointer object
            //box = new Box();                                                          // Test case 1 pointer object
        }

    public:
        //Box*  box;       //  Test Case 1:  pointer object (Everything is ok!)
        Box  box;          //  Test Case 2:  non-pointer object (with BUG)    
        int*     datan;            


    public:
        int             m_id = 0;            
        int             c_num;     
        int             support_num = 0;      
        const int       offset = 12;      
        float           delta = 1.2;     

};


// A minimal version for kernel

__global__ void WorkFlow_kernel_forcore(Core* core)
{
    volatile int coreno = blockIdx.x;
    if(threadIdx.x == 0) 
    {
        printf("\n--->Kernel data!\n");
        printf("  Core address in kernel: %p\n", core);
        printf("  Box address in kernel: %p\n", &(core->box));
        //printf("  Box address in kernel: %p\n", core->box);
        printf("\n  Addr m_id: %p\n", &(core->m_id));               
        printf("  Addr c_num: %p\n", &(core->c_num));             
        printf("  Addr support_num: %p\n", &(core->support_num)); 
        printf("  Addr Offset: %p\n", &(core->offset));           
        printf("  Addr Delta: %p\n", &(core->delta));             

        printf("\n  Val m_id: %d\n", core->m_id);               
        printf("  Val c_num: %d\n", core->c_num);             
        printf("  Val support_num: %d\n", core->support_num); 
        printf("  Val Offset: %d\n", core->offset);           
        printf("  Val Delta: %.5f\n", core->delta);    
    }

    // The kernel outputs the wrong result for non-pointer Core::box.
}

//main.cu
#include <cuda_runtime.h>
#include "Core.cuh"


int main()
{
    // 1 Only Core involved
    // This is a minimal version suggested by Sebastian (only Core and kernel existed here)

    Core* core = new Core(20);   // Here, the Core still inherits from Managed. Because it seems more convenient to execute constructor on device with help of new and delete overload.  
    
    core->initialize();

    printf("  Double check core address: %p\n", core);
    printf("  Double check box address: %p\n", &(core->box));
    //printf("  Double check box address: %p\n", core->box);
    printf("\n  Double check Addr m_id: %p\n", &(core->m_id));               
    printf("  Double check Addr c_num: %p\n", &(core->c_num));             
    printf("  Double check Addr support_num: %p\n", &(core->support_num)); 
    printf("  Double check Addr Offset: %p\n", &(core->offset));           
    printf("  Double check Addr Delta: %p\n", &(core->delta));

    WorkFlow_kernel_forcore<<<1,1>>>(core);  // The output is the wrong result when non-pointer Core::box defined!

    delete core;

    // ----------------------------------Wrong result address output
    // -->Managed call!
    //     Address for Managed constructor: 0000000A00000000
    //     Double check core address: 0000000A00000000
    //     Double check box address: 0000000000000000

    //     Double check Addr m_id: 0000000A00000010
    //     Double check Addr c_num: 0000000A00000014
    //     Double check Addr support_num: 0000000A00000018
    //     Double check Addr Offset: 0000000A0000001C
    //     Double check Addr Delta: 0000000A00000020

    // --->Kernel data!
    //     Core address in kernel: 0000000A00000000
    //     Box address in kernel: 0000000A00000004

    //     Addr m_id: 0000000A00000018
    //     Addr c_num: 0000000A0000001C
    //     Addr support_num: 0000000A00000020
    //     Addr Offset: 0000000A00000024
    //     Addr Delta: 0000000A00000028

    //     Val m_id: 0
    //     Val c_num: 12
    //     Val support_num: 1067030938
    //     Val Offset: 0
    //     Val Delta: 0.00000


    // ----------------------------------Correct result address output
    // -->Managed call!
    //     Address for Managed constructor: 0000000A00000000
    // -->Managed call!
    //     Address for Managed constructor: 0000000A00030000

    //     Double check core address: 0000000A00000000
    //     Double check box address: 0000000A00030000

    //     Double check Addr m_id: 0000000A00000010
    //     Double check Addr c_num: 0000000A00000014
    //     Double check Addr support_num: 0000000A00000018
    //     Double check Addr Offset: 0000000A0000001C
    //     Double check Addr Delta: 0000000A00000020

    // --->Kernel data!
    //     Core address in kernel: 0000000A00000000
    //     Box address in kernel: 0000000A00030000

    //     Addr m_id: 0000000A00000010
    //     Addr c_num: 0000000A00000014
    //     Addr support_num: 0000000A00000018
    //     Addr Offset: 0000000A0000001C
    //     Addr Delta: 0000000A00000020

    //     Val m_id: 0
    //     Val c_num: 20
    //     Val support_num: 0
    //     Val Offset: 12
    //     Val Delta: 1.20000


    // 2 This version replace the unified memory of core by cudaMalloc and cudaMemcpy. 
    // NOTE: Before run the test 2, please comment the (cancel the inheritance from Managed)
    // class Core /*: public Managed*/ {

    //Core* host_core = new Core(20);
    //Core* device_core;

    //cudaMalloc(&device_core, sizeof(Core));
    //cudaMemcpy(device_core, host_core, sizeof(Core), cudaMemcpyHostToDevice);
    //WorkFlow_kernel_forcore<<<1,1>>>(device_core);
    
    // !!!---> This kernel output the correct information: 0, 20, 0, 12, 1.2

    //delete host_core;
    //cudaFree(device_core);
    return 0;
}

//CMakeList.txt
project (gputask CXX CUDA)
CMAKE_MINIMUM_REQUIRED(VERSION 3.10 FATAL_ERROR)


if (MSVC)
    set(CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS ON)
endif (MSVC)



if(CMAKE_INSTALL_PREFIX_INITIALIZED_TO_DEFAULT)
  set(CMAKE_INSTALL_PREFIX "${CMAKE_SOURCE_DIR}/gputask" CACHE PATH "This is default path" FORCE)
endif()



SET(CMAKE_SKIP_BUILD_RPATH FALSE)
SET(CMAKE_BUILD_WITH_INSTALL_RPATH FALSE)
SET(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE)


option(ENABLE_EMBED_CUDA "Enable embedding of the CUDA libraries into lib" OFF)


set(GPUTASK_NVCC_ARCHS_DEFAULT "")
list(APPEND GPUTASK_NVCC_ARCHS_DEFAULT 75)
set(GPUTASK_NVCC_ARCHS ${GPUTASK_NVCC_ARCHS_DEFAULT} CACHE STRING "The SM architectures to build code for.")

set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Xcudafe --diag_suppress=code_is_unreachable")

if(NOT CMAKE_BUILD_TYPE AND NOT CMAKE_CONFIGURATION_TYPES)
  message(STATUS "Setting build type to 'Release' as none was specified.")
  set(CMAKE_BUILD_TYPE Release CACHE STRING "Choose the type of build." FORCE)
  set_property(CACHE CMAKE_BUILD_TYPE PROPERTY STRINGS "Debug" "Release"
    "MinSizeRel" "RelWithDebInfo")
endif()


set(CMAKE_CXX_STANDARD 14)
SET(CMAKE_CUDA_STANDARD 14)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

if(CMAKE_COMPILER_IS_GNUCXX OR CMAKE_CXX_COMPILER_ID MATCHES "Clang")
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wno-unknown-pragmas -Wno-deprecated-declarations -DMPM_CODE")
    set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -Wall -Wno-unknown-pragmas")
endif()

set(CUDA_ARCH_LIST 70 75 CACHE STRING "List of target sm_ architectures to compile CUDA code for. Separate with semicolons.")


foreach(_cuda_arch ${CUDA_ARCH_LIST})
    set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -gencode=arch=compute_${_cuda_arch},code=sm_${_cuda_arch}")
endforeach (_cuda_arch)


if (ENABLE_EMBED_CUDA)
    get_filename_component(_cuda_libdir ${CUDA_CUDART_LIBRARY} PATH)
    FILE(GLOB _cuda_libs ${_cuda_libdir}/libcurand.* ${_cuda_libdir}/libcufft.* ${_cuda_libdir}/libcusolver.* ${_cuda_libdir}/libcusparse.*)
    install(PROGRAMS ${_cuda_libs} DESTINATION ${CMAKE_INSTALL_PREFIX}/lib)
endif ()

set(GPUTASK_COMMON_LIBS ${ADDITIONAL_LIBS})
list(APPEND GPUTASK_COMMON_LIBS ${CUDA_LIBRARIES} ${CUDA_cufft_LIBRARY} ${CUDA_curand_LIBRARY})

if (ENABLE_NVTOOLS)
    list(APPEND GPUTASK_COMMON_LIBS ${CUDA_nvToolsExt_LIBRARY})
endif()

include_directories(${CUDA_INCLUDE})

exec_program("date +%x" OUTPUT_VARIABLE COMPILE_DATE)

set(CUDA_VERBOSE_BUILD on)


set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -DNVCC -ftz=true")

set(GPUTASK_ROOT "${CMAKE_SOURCE_DIR}")


ADD_EXECUTABLE(mytask ${CMAKE_CURRENT_SOURCE_DIR}/main.cu)

INSTALL(TARGETS mytask DESTINATION ${CMAKE_INSTALL_PREFIX}/bin)

Kernel prints different information for class Core between the non-pointer or pointer see the printf code block.

Platform information:

OS: Win 10

Cuda: 11.1.74 ship with RTX 2060

Win SDK 10.0.18362.0

MSVC 19.28.29334.0

Visual Studio 16 2019

In brief, it seems that the wrong output in test1 of main.cu results from the class Core : public Managed (overload unified memory new and delete).

The new revised code print the address for address of all the members of Core at the stage of cudaMallocManaged and kernel.

It is clear that the address of box differs in BUG version while kernel is called (say box address jumps from 0 to 4).

There is no such thing in the correct version. It may be deduced that the box address flows from somewhere to somewhere?

Does it mean that memory out of range or leak? (I guess but not sure about it).

SOLVED------------------------->!!!!

Thanks to Robert, I find the reason of this bug. Please refer to NVIDIA DOC.

The CUDA compiler follows the IA64 ABI for class layout, while the Microsoft host compiler does not. Let T denote a pointer to member type, or a class type that satisfies any of the following conditions:

T has virtual functions.

T has a virtual base class.

T has multiple inheritance with more than one direct or indirect empty base class.

All direct and indirect base classes B of T are empty and the type of the first field F of T uses B in its definition, such that B is laid out at offset 0 in the definition of F.

Since both box and Core are children of Managed, if we place the box at the first order, the code matches the fourth case，All direct and indirect base classes B of T are empty...

And undefined behaviour on Win OS can appear due to the IA64 ABI for cuda compared to x64 ABI (Win host).

I greatly appreciate your suggestions! Thank you very much!

There is much too much code here. Please create a [mcve] and I emphisise the word "minimal". — AdrianHHH, Dec 20 '21 at 20:34
When I run your posted code, as-is, I get the "correct" output: 0, 0, 20, 0, 12, 1.2. So I am unable to see whatever issue you are having. I get the same output whether I use `Box box;` or `Box *box;`. Furthermore, `compute-sanitizer` reports no error in either case. You don't seem to indicate which GPU you are running on. — Robert Crovella, Dec 20 '21 at 23:11
Thank you! Maybe the bug is not reproducible in your PC. I have run it so much times on my PC and I can get the wrong one each time (case 2). — Mangoccc, Dec 20 '21 at 23:15
The output for case 2 may indicate that the memory shifts at a fixed length, say the output of 1067030938 should be delta (1.2 with float format but printed as integer). However, I fail to infer why the memory moves. — Mangoccc, Dec 20 '21 at 23:28
You have undefined behaviour in your kernel with that use of __syncthreads() but that isn't the likely source of your problem — talonmies, Dec 21 '21 at 02:05
There you go. It seems that comment __sync or not make no difference. — Mangoccc, Dec 21 '21 at 02:12
Could you try to further reduce your code? E.g. directly calling just 1 thread without the Task class, using 1 Core instead of a core list as parameter, You can also allocate the core class manually with cudaMallocManaged instead of deriving from Managed. The abstractions are good C++, but for finding this bug, the code should be reduced to the bare minimum. — Sebastian, Dec 21 '21 at 08:21
Thanks! I have posted a minimal kernel which gets the Core directly and run with blocks and threads <<<1,1>>>. Unfortunately, I got the same wrong result. Maybe the inheritance from the base class Managed is the key. However, problem may be that the constructor on device depends on the overload of new and delete to my knowledge (as a newbee). I will try another way to avoid the overload of new, e.g., try to use cudaMalloc and cudaMemcpy to reproduce the result. — Mangoccc, Dec 21 '21 at 09:28
Either getting the same wrong result or not is both good! That closes in on the actual error. — Sebastian, Dec 21 '21 at 15:20
As I understand (by your comments in the code), the version with cudaMalloc/cudaMemcpy gives the correct result. Could you please replace the Managed class by manual cudaMallocManaged as an intermediate step? And check (e.g. by printing out to the console) whether the new/delete of your Managed class is used at all. Could you also check the memory addresses (e.g. printing out a pointer) at the position of cudaMallocManaged and at the invocation of the kernel, it could be that your class was copied in between and something went wrong at that point. — Sebastian, Dec 21 '21 at 15:26
Hello guys! There is a new finding about the address change when kernel takes over the variables. Details has been posted in the questions. You may refer to the results outputed from bug and correct version including the address and values of all the members of Core. Maybe it is the last step to explain why box address changes when existed as a non-pointer object? Thank you! — Mangoccc, Dec 21 '21 at 16:32
You may be running into [this](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#windows-specific). Try reversing the order of `Box box;` and `int* datan;` — Robert Crovella, Dec 21 '21 at 18:05
Amazing! Thank you Robert! The document points out the reason for my bug. The order of the variables really does matter! After reversing the order of box and datan, the kernel output the correct answer now! The question has been solved! Thank you once again! — Mangoccc, Dec 22 '21 at 08:23
Great! Can the layout be secured with static asserts for the relative offset of each member? That would as best practice safeguard against this bug — Sebastian, Dec 22 '21 at 08:56
Sorry Sebastian. I may not answer your questions as for the static assert for a relative offset. I do not know too much about the detailed memory management or the offset. As a freshman, the low-level things may beyond my ability. — Mangoccc, Dec 22 '21 at 09:01
I found a solution with offsetof. https://stackoverflow.com/questions/39907697/can-you-write-a-static-assert-to-verify-the-offset-of-data-members/39907754 https://en.cppreference.com/w/cpp/types/offsetof I would recommend those asserts to check at compile time that the class is compiled and layouted the same way in both compilers. It is a bit more work, but helps to prevent this error, which could lead to hard-to-find runtime bugs. — Sebastian, Dec 22 '21 at 09:23
Thank you for your advice. I will try it in my code. Actually, my whole code has about 50 files that run on LINUX originally. My tutor ask me to finish a Win version and the bug emerges. Really a lot of work to do being with some hard-to-find bugs as you said. — Mangoccc, Dec 22 '21 at 09:38

score 2 · Accepted Answer · edited Dec 22 '21 at 10:27

Question shows that when Core is as created by cudaMallocManaged, bug appears. However, for Core created by cudaMalloc and cudaMemcpy, the kernel gives the correct answer.

This bug relates to CUDA DOC.

In detail, the CUDA DOC denotes that:

The CUDA compiler follows the IA64 ABI for class layout, while the Microsoft host compiler does not. Let T denote a pointer to member type, or a class type that satisfies any of the following conditions:

T has virtual functions.

T has a virtual base class.

T has multiple inheritance with more than one direct or indirect empty base class.

All direct and indirect base classes B of T are empty and the type of the first field F of T uses B in its definition, such that B is laid out at offset 0 in the definition of F.

Let C denote T or a class type that has T as a field type or as a base class type. The CUDA compiler may compute the class layout and size differently than the Microsoft host compiler for the type C. As long as the type C is used exclusively in host or device code, the program should work correctly.

Passing an object of type C between host and device code has undefined behavior e.g., as an argument to a global function or through cudaMemcpy*() calls.

Since both of Box and Core are children of Managed (empty class overloading new and delete operator).

If we place the box (non-pointer object) at the first field of Core, we meets the fourth case All direct and indirect base classes B of T are empty and the type of the first field F of T uses B in its definition.

Due to the different ABI between Windows host (x64) and CUDA device (IA64), the undefined behaviour of kernel appears as result.

-------------> Personal analysis

The CUDA DOC also denotes that the undefined behaviour of kernel can be associated with class which is created on host but run on device and vice versa.

In other words, Core created using cudaMalloc may avoid the bug by a consistent creation and run environment (both host or both device).

The same thing for the box as pointer object, because it eliminates the bug by avoiding the fourth case (children class of a empty base class locates at the first field).

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). — MD. RAKIB HASAN, Dec 22 '21 at 09:16

Cuda unified memory: Program gets different result when use pointer or non-pointer object as class member

1 Answers1