I created a local array in my OpenCL kernel via llvm, call it lookuptable of size [ 256 x i32 ]. Later I insert code via llvm to fill the array with values. My issue is that when I attempt to generate code that accesses the array I cannot seem to correctly isolate the pointer to the element I wish. I can incorrectly index to an element if I use an obscure local variable called flattened:
Value *xs_ys_mul = builder.CreateMul(shifted_x_size, y_size, "xs_ys_mul");
Value *xs_ys_z_mul = builder.CreateMul(xs_ys_mul, z, "xs_ys_z_mul");
Value *xs_y_mul = builder.CreateMul(shifted_x_size, y, "xs_y_mul");
Value *sum_1 = builder.CreateAdd(xs_ys_z_mul, xs_y_mul, "sum_1");
Value *flattened = builder.CreateAdd(sum_1, shifted_x, "FLATTENED");
This would be the dimensionally flattened local workgroup id. That is irrelevant though.
This is how the GEP is created (builder is an instance of IRBuilder):
std::vector<llvm::Value *> tmp_args;
tmp_args.push_back(builder.getInt32(0));
tmp_args.push_back(flattened);
Value *table_addr = builder.CreateGEP(M.getNamedGlobal(tablename), tmp_args, tablename+"_IDX");
M in this case is the Module object. The table_addr produced is:
%i32_cllocal_TABLE_IDX = getelementptr [256 x i32] addrspace(3)* @i32_cllocal_TABLE, i32 0, i32 %FLATTENED
However, if I want to fill it correctly by walking through the indices in LLVM with a for loop my code (omitting the loop structure, and where "index" is a i32 loop counter):
std::vector<llvm::Value *> tmp_args;
tmp_args.push_back(builder.getInt32(0));
tmp_args.push_back(builder.getInt32(index));
Value *table_addr = builder.CreateGEP(M.getNamedGlobal(tablename), tmp_args, tablename+"_IDX");
The dump() of the table_addr in that case is (when index==0):
i32 addrspace(3)* getelementptr inbounds ([256 x i32] addrspace(3)* @i32_cllocal_CRC32_TABLE, i32 0, i32 0)
Which means that further down when I do the store:
store_inst = builder.CreateStore(builder.getInt32(tablevalues[index]), table_addr);
I get this output:
store volatile i32 0, i32 addrspace(3)* getelementptr inbounds ([256 x i32] addrspace(3)* @i32_cllocal_TABLE, i32 0, i32 0), align 4
Which doesn't look right, but more importantly I get a SIGABRT on an assert when "index" > 0:
Casting.h:194: typename llvm::cast_retty<To, From>::ret_type llvm::cast(const Y&) [with X = llvm::CompositeType, Y = llvm::Type*]: Assertion `isa<X>(Val) && "cast<Ty>() argument of incompatible type!"' failed.
I'm a bit stuck. I don't understand what is the difference between giving an explicit value to index into the array versus, some obscure value that is calculated at run time. Any insights would be greatly appreciated.
UPDATE: What I ended up doing is this (alloc is only done once, I'm just including it in this code block for visual purposes, it is actually outside the for loop):
std::vector<llvm::Value *> tmp_args;
tmp_args.push_back(builder.getInt32(0));
idxInst = builder.CreateAlloca(builder.getInt32Ty(), 0, "idxvalue");
//----- Inside the loop below --------------------------------------
idxStore = builder.CreateStore(builder.getInt32(index), idxInst);
indexValue = builder.CreateLoad(idxInst, "INDEX_VAL");
tmp_args.push_back(indexValue);
table_addr = builder.CreateGEP(table_ptr, tmp_args, "_IDX_PUT");
tmp_args.pop_back();
store_inst = builder.CreateStore(builder.getInt32(tableValues[index]), table_addr, "_ELEM_STORE");
store_inst->setAlignment(4);
Which emitted this code (for index == 0 and 1):
%idxvalue = alloca i32
store i32 0, i32* %idxvalue
%INDEX_VAL = load i32* %idxvalue
%i32_cllocal_TABLE_IDX_PUT = getelementptr [256 x i32] addrspace(3)* @i32_cllocal_TABLE, i32 0, i32 %INDEX_VAL
store volatile i32 0, i32 addrspace(3)* %i32_cllocal_TABLE_IDX_PUT, align 4
store i32 1, i32* %idxvalue
%INDEX_VAL1 = load i32* %idxvalue
%i32_cllocal_TABLE_IDX_PUT2 = getelementptr [256 x i32] addrspace(3)* @i32_cllocal_TABLE, i32 0, i32 %INDEX_VAL1
store volatile i32 1996959894, i32 addrspace(3)* %i32_cllocal_TABLE_IDX_PUT2, align 4
It looks correct now, to me it seems like a weird way to emi code though since I'm storing then loading immediately, but I suppose that will get optimized out, or I'll try using that mem2reg. Thanks for the help @Oak.